Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a large number of task/position found in the etcd cause the cdc cannot work. #7131

Closed
3AceShowHand opened this issue Sep 20, 2022 · 1 comment · Fixed by #7216
Closed

a large number of task/position found in the etcd cause the cdc cannot work. #7131

3AceShowHand opened this issue Sep 20, 2022 · 1 comment · Fixed by #7216
Assignees
Labels
affects-6.1 area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@3AceShowHand
Copy link
Contributor

What did you do?

the capture panic and crashed frequently for a long period.

What did you expect to see?

after the panic was fixed, the cdc server should works normally.

What did you see instead?

[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/3941cc8d-0f46-4a75-941f-a817d68a7bad/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/0597d9b4-65c9-4815-8c59-98a539ca35f5/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/c14fad0a-bf8e-4091-b94c-fd9e1db194fd/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/451e03cc-bb61-4c48-9bbc-9e35c197d2c8/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/5c78d40b-1217-4e6d-b96b-8a776011d5e1/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/b6fd703d-edbb-4315-90a6-3eb26d7fa4b3/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/a824103d-e271-40d1-81ef-5c8902eca794/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/d0d1b5af-1463-4eb5-8203-27d38c3a2c4e/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/28a168c2-6ad8-47fd-918f-224616c31a06/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/b0e2c01f-e52b-42de-9836-6599e085f45a/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/31cddbd1-a460-4409-b9d1-da5a1afc3989/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/bff36fdc-2caf-47eb-909b-e170e3c71fd5/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/5b54df1d-2c7b-4d41-a995-1e934009d3a6/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/b4511af6-f2a7-4cea-aeb2-80f8bec01940/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/72e49c50-cf9e-407d-8659-b782bfbb404a/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.659 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/07ca0e7d-0393-446c-8998-08245fabbc70/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.660 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/32130488-3538-4734-9a07-7deb47a4c86e/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.660 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/973c4f90-191d-4522-b8eb-0b3975c505ad/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.660 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/37ab8206-1be9-4090-aef2-dba5d35d3aab/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.660 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/e46f9e36-7f1f-4f84-8394-06edcb2f09e9/canal-json-kafka-test] [value=]
[2022/09/20 16:23:35.660 +08:00] [WARN] [batch.go:57] ["etcd txn ops exceed the limit"] [key=/tidb/cdc/default/default/task/position/65597121-adfb-4b88-8728-60de84e57318/canal-json-kafka-test] [value=]
...
[2022/09/20 16:23:35.723 +08:00] [INFO] [client.go:235] ["WatchWithChan exited"] [role=owner]
[2022/09/20 16:23:35.726 +08:00] [INFO] [capture.go:464] ["owner resigned successfully"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a] [ownerRev=5130995]
[2022/09/20 16:23:35.726 +08:00] [WARN] [capture.go:467] ["run owner exited with error"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a] [ownerRev=5130995] [error="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128"] [errorVerbose="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/orchestrator.getBatchChangedState\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/batch.go:60\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:374\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:234\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:491\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).campaignOwner\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:443\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func3\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.727 +08:00] [INFO] [capture.go:314] ["owner routine exited"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a] [error="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128"] [errorVerbose="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/orchestrator.getBatchChangedState\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/batch.go:60\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:374\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:234\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:491\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).campaignOwner\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:443\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func3\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.727 +08:00] [INFO] [client.go:235] ["WatchWithChan exited"] [role=processor]
[2022/09/20 16:23:35.728 +08:00] [INFO] [processor.go:518] ["processor exited"] [capture=97a65870-e9ce-49d1-9306-239eec1bed1a] [namespace=default] [changefeed=canal-json-kafka-test] [error="[CDC:ErrAdminStopProcessor]stop processor by admin command"] [errorVerbose="[CDC:ErrAdminStopProcessor]stop processor by admin command\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:555\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:497\ngithub.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/manager.go:128\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:266\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:491\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:334\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.728 +08:00] [INFO] [processor.go:1050] ["processor closing ..."] [namespace=default] [changefeed=canal-json-kafka-test]
[2022/09/20 16:23:35.728 +08:00] [INFO] [processor.go:1111] ["processor closed"] [namespace=default] [changefeed=canal-json-kafka-test]
[2022/09/20 16:23:35.728 +08:00] [INFO] [capture.go:335] ["processor routine exited"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a] [error="[CDC:ErrPDEtcdAPIError]etcd api call error: context canceled"] [errorVerbose="[CDC:ErrPDEtcdAPIError]etcd api call error: context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:32\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:505\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:334\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.728 +08:00] [INFO] [capture.go:574] ["processor manager closed"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a]
[2022/09/20 16:23:35.728 +08:00] [INFO] [capture.go:582] ["message router closed"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a]
[2022/09/20 16:23:35.729 +08:00] [INFO] [client.go:771] ["[pd] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2022/09/20 16:23:35.729 +08:00] [INFO] [client.go:709] ["[pd] exit tso dispatcher"] [dc-location=global]
[2022/09/20 16:23:35.730 +08:00] [INFO] [capture.go:275] ["the capture routine has exited"]
[2022/09/20 16:23:35.730 +08:00] [INFO] [client.go:771] ["[pd] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2022/09/20 16:23:35.730 +08:00] [INFO] [client.go:709] ["[pd] exit tso dispatcher"] [dc-location=global]
[2022/09/20 16:23:35.730 +08:00] [ERROR] [server.go:267] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:32\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:267\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.730 +08:00] [WARN] [server.go:159] ["cdc server exits with error"] [error="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128"] [errorVerbose="[CDC:ErrEtcdTxnOpsExceed]patch ops:4002 of a single changefeed exceed etcd txn max ops:128\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/orchestrator.getBatchChangedState\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/batch.go:60\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:374\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:234\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:491\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).campaignOwner\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:443\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func3\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:313\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/20 16:23:35.731 +08:00] [INFO] [server.go:395] ["table actor system closed"] [duration=278.121µs]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=6]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=1]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=0]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=7]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=3]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=4]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=5]
[2022/09/20 16:23:35.731 +08:00] [INFO] [compactor.go:142] ["compactor actor quit"] [ID=2]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=0]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=4]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=3]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=7]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=1]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=6]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=2]
[2022/09/20 16:23:35.732 +08:00] [INFO] [db.go:265] ["db actor quit"] [ID=5]
[2022/09/20 16:23:35.735 +08:00] [INFO] [server.go:402] ["sorter actor system closed"] [duration=3.505642ms]
[2022/09/20 16:23:35.735 +08:00] [INFO] [capture.go:574] ["processor manager closed"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a]
[2022/09/20 16:23:35.735 +08:00] [INFO] [capture.go:582] ["message router closed"] [captureID=97a65870-e9ce-49d1-9306-239eec1bed1a]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)

TiCDC version (execute cdc version):

master
@3AceShowHand
Copy link
Contributor Author

This should not happen frequently, so make it as moderate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants