Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Owner exits unexpectedly when PD leader is killed -9 #4248

Closed
Tracked by #3844
overvenus opened this issue Jan 7, 2022 · 2 comments · Fixed by #4474
Closed
Tracked by #3844

Owner exits unexpectedly when PD leader is killed -9 #4248

overvenus opened this issue Jan 7, 2022 · 2 comments · Fixed by #4474
Assignees
Labels
affects-5.4 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@overvenus
Copy link
Member

What did you do?

  1. Running a changefeed.
  2. Kill PD leader (also etcd leader too).

Owner node exits unexpectedly.

image

[2022/01/03 22:59:44.154 +08:00] [WARN] [etcd_worker.go:405] ["Etcd transaction took too long"] [duration=11.000997285s]
[2022/01/03 22:59:44.154 +08:00] [INFO] [client.go:189] ["WatchWithChan exited"]
[2022/01/03 22:59:44.154 +08:00] [INFO] [capture.go:430] ["run owner exited"] [error="etcdserver: request timed out, possibly due to previous leader failure"] [errorVerbose="etcdserver: request timed out, possibly due to previous leader failure\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).commitChangedState\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:409\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:337\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:207\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:450\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).campaignOwner\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:428\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run.func2\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:292\nruntime.goexit\n\truntime/asm_amd64.s:1581"]
[2022/01/03 22:59:44.156 +08:00] [INFO] [capture.go:293] ["the owner routine has exited"] [error="etcdserver: request timed out, possibly due to previous leader failure"] [errorVerbose="etcdserver: request timed out, possibly due to previous leader failure\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).commitChangedState\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:409\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).applyPatchGroups\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:337\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:207\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:450\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).campaignOwner\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:428\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run.func2\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:292\nruntime.goexit\n\truntime/asm_amd64.s:1581"]
…
[2022/01/03 22:59:59.836 +08:00] [INFO] [capture.go:317] ["the processor routine has exited"] [error="[CDC:ErrPDEtcdAPIError]etcd api call error: context canceled"] [errorVerbose="[CDC:ErrPDEtcdAPIError]etcd api call error: context canceled\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211009033009-93128226aaa3/normalize.go:159\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:30\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:464\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run.func3\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:316\nruntime.goexit\n\truntime/asm_amd64.s:1581"]

What did you expect to see?

A TiCDC node should never exit unexpectedly.

What did you see instead?

A TiCDC node exits unexpectedly.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

release-5.4

TiCDC version (execute cdc version):

e274ab691d3b1bbd21e4d0e52e512003762b1767
@overvenus overvenus added type/bug The issue is confirmed as a bug. severity/moderate area/ticdc Issues or PRs related to TiCDC. labels Jan 7, 2022
@overvenus overvenus self-assigned this Jan 7, 2022
overvenus pushed a commit to ti-chi-bot/tiflow that referenced this issue Feb 21, 2022
overvenus pushed a commit to ti-chi-bot/tiflow that referenced this issue Feb 21, 2022
ti-chi-bot pushed a commit that referenced this issue Feb 21, 2022
…) (#4474)

close #4248

Signed-off-by: Neil Shen <overvenus@gmail.com>
@Tammyxia
Copy link

Tammyxia commented Mar 2, 2022

DBaaS error message:
[2022/03/02 03:15:58.471 +00:00] [WARN] [client.go:86] ["etcd RPC failed"] [RPC=Put] [error="etcdserver: requested lease not found"]
[2022/03/02 03:15:58.471 +00:00] [INFO] [capture.go:254] ["the capture routine has exited"]
[2022/03/02 03:15:58.471 +00:00] [ERROR] [http_status.go:74] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:30\ngithub.com/pingcap/tiflow/cdc.(*Server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/http_status.go:74\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/03/02 03:15:58.471 +00:00] [ERROR] [client.go:752] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"]
[2022/03/02 03:15:58.471 +00:00] [INFO] [client.go:666] ["[pd] exit tso dispatcher"] [dc-location=global]
[2022/03/02 03:15:58.471 +00:00] [ERROR] [server.go:131] ["run server"] [error="[CDC:ErrCaptureRegister]capture register to etcd failed: [CDC:ErrPDEtcdAPIError]etcd api call error: [CDC:ErrReachMaxTry]reach maximum try: 8: etcdserver: requested lease not found\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:30\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).register\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:511\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:272\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:244\ngithub.com/pingcap/tiflow/cdc.(*Server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server.go:258\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"]
[2022/03/02 03:15:58.472 +00:00] [ERROR] [client.go:752] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"]
[2022/03/02 03:15:58.472 +00:00] [INFO] [client.go:666] ["[pd] exit tso dispatcher"] [dc-location=global]
Error: run server: [CDC:ErrCaptureRegister]capture register to etcd failed: [CDC:ErrPDEtcdAPIError]etcd api call error: [CDC:ErrReachMaxTry]reach maximum try: 8: etcdserver: requested lease not found

@CharlesCheung96
Copy link
Contributor

DBaaS error message: [2022/03/02 03:15:58.471 +00:00] [WARN] [client.go:86] ["etcd RPC failed"] [RPC=Put] [error="etcdserver: requested lease not found"] [2022/03/02 03:15:58.471 +00:00] [INFO] [capture.go:254] ["the capture routine has exited"] [2022/03/02 03:15:58.471 +00:00] [ERROR] [http_status.go:74] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:30\ngithub.com/pingcap/tiflow/cdc.(*Server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/http_status.go:74\nruntime.goexit\n\truntime/asm_amd64.s:1371"] [2022/03/02 03:15:58.471 +00:00] [ERROR] [client.go:752] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"] [2022/03/02 03:15:58.471 +00:00] [INFO] [client.go:666] ["[pd] exit tso dispatcher"] [dc-location=global] [2022/03/02 03:15:58.471 +00:00] [ERROR] [server.go:131] ["run server"] [error="[CDC:ErrCaptureRegister]capture register to etcd failed: [CDC:ErrPDEtcdAPIError]etcd api call error: [CDC:ErrReachMaxTry]reach maximum try: 8: etcdserver: requested lease not found\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20211224045212-9687c2b0f87c/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:30\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).register\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:511\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:272\ngithub.com/pingcap/tiflow/cdc/capture.(*Capture).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:244\ngithub.com/pingcap/tiflow/cdc.(*Server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server.go:258\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"] [2022/03/02 03:15:58.472 +00:00] [ERROR] [client.go:752] ["[pd] fetch pending tso requests error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]context canceled: context canceled"] [2022/03/02 03:15:58.472 +00:00] [INFO] [client.go:666] ["[pd] exit tso dispatcher"] [dc-location=global] Error: run server: [CDC:ErrCaptureRegister]capture register to etcd failed: [CDC:ErrPDEtcdAPIError]etcd api call error: [CDC:ErrReachMaxTry]reach maximum try: 8: etcdserver: requested lease not found

It seems that this problem occurs during the capture creation or restart process, can you provide a more detailed log?

CharlesCheung96 added a commit that referenced this issue Apr 28, 2022
etcd/client(ticdc): add retry operation for etcd transaction api (#4248) (#4474)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants