TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

fubinzh · 2024-10-09T06:18:41Z

What did you do?

07:51 - 07:55 Create 4 CDC clusters (each CDC cluster with 32C CDC nodes), for each CDC cluster create one changefeed, each changefeed replicating ~40k tables (overall ~160k tables)
07:58 Pause changefeed and update config
08:12 resume all changefeeds
09:38 Pause all changefeed
10:33 delete changefeed (Delete changefeeds from TiDB cloud UI)
10:07 - 10:33 manually kill and restart all TiKV

What did you expect to see?

When task paused, previous pending initial scan task should be cancelled, Otherwise this might block following changefeed initilization.

What did you see instead?

After step 1, one changefeed initilization finished (all tables added the changefeed)
After step 3, For the initilized changefeed, table count is 0 for 1.4h+ before it was paused.
After step 4, initial scan pending tasks is decreasing, but it is very slow.
After step 6, the pending tasks disappeared.

Versions of the cluster

v6.5.9

fubinzh · 2024-10-09T06:19:02Z

Encountered similar issue when changefeed failed.

Changefeed failed at 10/03 00:00 during initial scan, but initial scan continues (and sorted disk actually increases) until I delete all changefeeds at 10/03 09:00.

fubinzh · 2024-10-09T06:19:14Z

/severity major

3AceShowHand · 2024-10-12T09:55:11Z

Pause and remove changefeed, close the kv client, this should also close the gRPC connections to the TiKV-CDC, and cancel connections.

3AceShowHand · 2024-10-14T07:56:04Z

I am trying to reproduce this scenario on my local test environment, here is some observations:

There are more than 100,000 regions in the TiKV cluster.

On the TiCDC side:

Resume the changefeed in around 22:03:07:265. connect to the TiKV, start incremental scan tasks.
Pause the changefeed in around 22:04:38.215. KV client receive a lot of region failed error, the last in 22:04:41:591. Which means for the KV client to fully closed, it takes around 3 seconds. There are 14954 errors, handled in 3 seconds, around 5000 / s.

On the TiKV side:

cdc register region found, a lot of regions registered.
The TiKV-CDC incremental scan workers thread count is default to 4, and incremental scan concurrency is 6. For each region, they are blocked on the incremental acquire the concurrency_semaphore
When pause the changefeed, TiKV-CDC deregister the connection, stop observing regions, cost around 3s also.
- stop observe regions, also set the downstream state to stop, so the incremental scan can be canceld.

[2024/10/12 22:03:07.265 +08:00] [INFO] [middleware.go:48] [/api/v2/changefeeds/blackhole-test/resume] [status=200] [method=POST] [path=/api/v2/changefeeds/blackhole-test/resume] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=146.190585ms]

[2024/10/12 22:04:38.215 +08:00] [INFO] [middleware.go:48] [/api/v1/changefeeds/blackhole-test/pause] [status=202] [method=POST] [path=/api/v1/changefeeds/blackhole-test/pause] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=78.699473ms]

[2024/10/12 22:04:41.591 +08:00] [INFO] [client.go:613] ["region failed"]

3AceShowHand · 2024-10-15T08:58:53Z

* The changefeed paused around 09:40

The event feed count is not dropped to 0 immediately

KV Client still receive a lot of kv entries

3AceShowHand · 2024-10-16T02:42:12Z

[2024/10/02 01:38:47.996 +00:00] [INFO] [feed_state_manager.go:243] ["handle admin job"] [namespace=default] [changefeed=changefeed-1841264655229501440] [job="{\"CfID\":{\"Namespace\":\"default\",\"ID\":\"changefeed-1841264655229501440\"},\"Type\":1,\"Error\":null,\"OverwriteCheckpointTs\":0}"]


[2024/10/02 01:38:47.996 +00:00] [INFO] [ddl_puller.go:692] ["close the ddl puller"] [namespace=default] [changefeed=changefeed-1841264655229501440_ddl_puller]

[2024/10/02 01:38:48.091 +00:00] [INFO] [changefeed.go:723] ["changefeed closed"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=181ns]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:147] ["Closed source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=44.248µs]

From the logs above, we know that both changefeed and processor closed.

But there is still some logs indicates that the kv client fetch data. the region report error, and then restart to connect to that region.

[2024/10/02 01:38:48.099 +00:00] [INFO] [client.go:613] ["region failed"] [span="[748000fffffffffffffe5f728000000000ff1f11290000000000fa, 748000fffffffffffffe5f728000000000ff1f1a570000000000fa)"] [regionId=1504] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"]
[2024/10/02 01:38:48.146 +00:00] [INFO] [client.go:1462] ["event feed starts to check locked regions"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=67502] 
...

This logs indicates that the processor is blocked for a long time, this may be the root cause of the issue.


[2024/10/01 23:59:09.085 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.00036425s] [role=processor]

[2024/10/02 01:38:48.093 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1h39m44.999948235s] [role=processor]
[2024/10/02 01:38:48.194 +00:00] [INFO] [puller.go:293] ["table puller has been stucked"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=82046] [tableName=`db_218`.`dh_pay_customer_0`] [resolvedTs=452818737561600000] [slowestRangeTs=452818737561600000] [range="[7480000000000140ff7e5f72ffffffffffffffffff0000000000fb, 7480000000000140ff7e5f730000000000fa)"]

[2024/10/02 00:00:58.937 +00:00] [INFO] [client.go:1485] ["event feed finds a uninitialized slow region"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=22030] [tableName=`db_134`.`dh_user_withdraw_order_0`] [slowRegion="{\"RegionID\":2176379,\"CheckpointTs\":452818737561600000,\"Initialized\":false,\"Created\":\"2024-10-01T23:54:58.945274175Z\"}"]

3AceShowHand · 2024-10-16T08:47:25Z

[2024/10/01 23:58:50.650 +00:00] [INFO] [processor.go:1242] ["Processor try to close source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/01 23:58:50.650 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
...
[2024/10/02 00:20:13.166 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=21m22.516489165s]

There is one log indicates that close the puller takes more than 21 minutes. another processor does not have such logs, and blocked for more than 2.5 hours.

[2024/10/02 02:32:39.092 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=2h33m35.999487376s] [role=processor]

3AceShowHand · 2024-10-16T10:09:49Z

The root cause of the issue comes from the kv client cannot unregister the event handle from the worker pool immediately.

When close the puller, kv client also close, it has to unregister the event handler from the worker pool first, this cost too much time, makes each puller cannot be closed immediately. When there are a lot of tables, each table has one puller, so that some puller not closed, still works normally.

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

[2024/10/02 02:04:51.764 +00:00] [WARN] [pool_impl.go:444] ["synchronize is taking too long, report a bug"] [duration=10.000434501s]

wk989898 · 2024-10-25T03:54:28Z

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

still find workerpool in v7.5

tiflow/cdc/server/server.go

Lines 350 to 352 in d2c35e4

    
           eg.Go(func() error { 
        
           	return kv.RunWorkerPool(egCtx) 
        
           })

fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Oct 9, 2024

ti-chi-bot bot added severity/major may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Oct 9, 2024

fubinzh mentioned this issue Oct 9, 2024

TiKV initial scan task not cancel/termicated when changefeed paused or failed tikv/tikv#17606

Closed

flowbehappy assigned 3AceShowHand Oct 12, 2024

3AceShowHand removed may-affects-7.5 may-affects-8.1 labels Oct 17, 2024

3AceShowHand mentioned this issue Oct 17, 2024

sourcemanager(ticdc): close all pullers asyncrhonsly to prevent block too long #11667

Open

3AceShowHand added affects-6.5 affects-7.1 and removed may-affects-6.5 may-affects-7.1 may-affects-5.4 may-affects-6.1 labels Oct 21, 2024

3AceShowHand mentioned this issue Oct 21, 2024

sourcemanager(ticdc): close all pullers asyncrhonsly to prevent block too long #11673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

fubinzh commented Oct 9, 2024

fubinzh commented Oct 9, 2024

fubinzh commented Oct 9, 2024

3AceShowHand commented Oct 12, 2024

3AceShowHand commented Oct 14, 2024

3AceShowHand commented Oct 15, 2024

3AceShowHand commented Oct 16, 2024 •

edited

Loading

3AceShowHand commented Oct 16, 2024 •

edited

Loading

3AceShowHand commented Oct 16, 2024 •

edited

Loading

wk989898 commented Oct 25, 2024

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

Comments

fubinzh commented Oct 9, 2024

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

fubinzh commented Oct 9, 2024

fubinzh commented Oct 9, 2024

3AceShowHand commented Oct 12, 2024

3AceShowHand commented Oct 14, 2024

3AceShowHand commented Oct 15, 2024

3AceShowHand commented Oct 16, 2024 • edited Loading

3AceShowHand commented Oct 16, 2024 • edited Loading

3AceShowHand commented Oct 16, 2024 • edited Loading

wk989898 commented Oct 25, 2024

3AceShowHand commented Oct 16, 2024 •

edited

Loading

3AceShowHand commented Oct 16, 2024 •

edited

Loading

3AceShowHand commented Oct 16, 2024 •

edited

Loading