Make TiDB server shutdown gracefully when PD is dead #18336

tiancaiamao · 2020-07-02T16:02:59Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run a cluster, kill pd, then kill tidb-server (Ctrl - C)

2. What did you expect to see? (Required)

tidb-server exit

3. What did you see instead (Required)

The process print a log of error log and never exit.

kill -USR1 pid to get the goroutine stack:

goroutine 1 [semacquire, 17 minutes]:
sync.runtime_Semacquire(0xc0004798b0)
        /media/genius/OS/project/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0004798a8)
        /media/genius/OS/project/go/src/sync/waitgroup.go:130 +0x64
github.com/pingcap/tidb/owner.(*ownerManager).Cancel(0xc000479830)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:121 +0x3d
github.com/pingcap/tidb/ddl.(*ddl).close(0xc0001b7f20)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:365 +0xbc
github.com/pingcap/tidb/ddl.(*ddl).Stop(0xc0001b7f20, 0x0, 0x0)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:296 +0x95
github.com/pingcap/tidb/domain.(*Domain).Close(0xc0005d2900)
        /media/genius/OS/project/src/github.com/pingcap/tidb/domain/domain.go:608 +0x2bf
main.closeDomainAndStorage()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:718 +0x3f
main.cleanup()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:730 +0x80
main.main()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:188 +0x1f1

It is block on domain.Close, and waiting for ownerManager to exit.
However, ownerManager is doing its CampaignOwner loop and it seems this loop never end ...

go.etcd.io/etcd/clientv3.(*txn).Commit(0xc000bff4d0, 0x0, 0x0, 0x0)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/txn.go:146 +0x16f
go.etcd.io/etcd/clientv3/concurrency.(*Election).Resign(0xc000690540, 0x3adc4a0, 0xc0004d3f00, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:138 +0x3b6
go.etcd.io/etcd/clientv3/concurrency.(*Election).Campaign(0xc000690540, 0x3adc4a0, 0xc000690500, 0xc000052c90, 0x24, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:98 +0x798
github.com/pingcap/tidb/owner.(*ownerManager).campaignLoop(0xc000479830, 0xc000678570)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:274 +0x6fe
created by github.com/pingcap/tidb/owner.(*ownerManager).CampaignOwner
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:194 +0x343

4. Affected version (Required)

master f31298f

5. Root Cause Analysis

The text was updated successfully, but these errors were encountered:

zimulala · 2020-07-07T01:56:28Z

PTAL @AilinKid

wjhuang2016 · 2020-09-01T07:38:01Z

@AilinKid Any update?

morgo · 2020-12-07T05:21:48Z

I think this is a duplicate of #10260 - please close both if you fix it :-)

AilinKid · 2020-12-08T11:37:44Z

I've investigated for a while.
1: the domain got some work to do before close TiDB, such as clean some registered information in PD, the ETCD campaign needs pdclient.done() signal to break the loop, However, the PD client is closed in the store layer which is after the domain close action. Putting it ahead seems valid but not that elegant because skipping clean is not a good manner.

2: Besides, there many sub-goroutines that need oracle ts to push forward their job, while the lost PD will cause the RPC request to fail, the backoff ctx (constructed with context.Background ) is the only channel that can break their loop.

Seems hard to find an elegant way to exit it without PD...

AilinKid · 2020-12-11T05:03:17Z

Notice: After trail and error, we found it occurs in the master, v4.0, v3.0... (maybe we always got this phenomenon from every beginning)

we tried:

func (do *Domain) Close() {
	if do == nil {
		return
	}
	startTime := time.Now()
	
        // if you are sure the linked pb is dead, you can put etcdcli close action forward here. 
        // because the campaign loop /  ddl owner checker ... will use this client to send 
        // request, which will stuck in the block, unless you close the context in the etcdcli
        // actively.
        // After you clicking the ctrl+c in the TiDB, you will still need to wait about 3 minutes
        // because these background goutinues have some retry counts and try to pull it up.
        // Then, the domain will close, then store will close, finally TiDB will close.

	if do.etcdClient != nil {
		terror.Log(errors.Trace(do.etcdClient.Close()))
	}

	if do.ddl != nil {
		terror.Log(do.ddl.Stop())
	}

AilinKid · 2020-12-11T05:09:49Z

But letting etcdcli close firstly is not always a good choice, because you can not tell whether it is caused by PD death. If the PD is good, this code will ignore much information cleaning in the ETCD, and stats related info won't be stored in the TiKV, and ...

Judging whether PD is dead is subjective to your mind, you can send PD request for testing. However, the connection refused can also be caused by unstable network isolation.

AilinKid · 2020-12-11T05:13:11Z

So I think, for a subjective extreme case (PD is dead), letting the user close TiDB forcibly by kill -9 PID is much understandable. The lost info in the TiDB instance is also expected for the user.

AilinKid · 2020-12-11T05:18:42Z

PTAL @morgo @bb7133 @wjhuang2016 @tiancaiamao

morgo · 2020-12-11T05:37:32Z

The use case I was thinking about is that the pd-server is dead because of an incorrect shutdown order. i.e. you want to shutdown all components, but pd-server shuts down first. tidb-server and tikv-server already correctly handle the case of incorrect startup order (they will spin waiting for pd-server).

In a distributed system it is hard to ensure order, so its nice if shutdown can have the same properties as startup.

AilinKid · 2020-12-11T06:13:49Z

Make sense, maybe we can change this issue as a feature request and try to figure out an elegant way to do this.

bb7133 · 2021-12-07T14:01:20Z

I decide to make a feature request instead of a bug.

tiancaiamao added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra labels Jul 2, 2020

zimulala added the status/TODO label Jul 3, 2020

tiancaiamao mentioned this issue Jul 13, 2020

worker aggressively output error log pingcap/dm#792

Closed

sre-bot added the severity/moderate label Aug 10, 2020

ichn-hu mentioned this issue Dec 4, 2020

Welcome to contribute #20804

Closed

AilinKid self-assigned this Dec 7, 2020

ti-chi-bot mentioned this issue Jun 22, 2021

executor: Let the SHUTDOWN statement do a graceful shutdown #25673

Merged

tisonkun removed the status/TODO label Sep 1, 2021

xiongjiwei mentioned this issue Dec 7, 2021

tidb-server does not shutdown gracefully if pd-server and/or tikv-server have gone away #10260

Closed

bb7133 changed the title ~~Close PD first and tidb-server fail to exit~~ Make TiDB server shutdown gracefully when PD is dead Dec 7, 2021

bb7133 added type/feature-request Categorizes issue or PR as related to a new feature. and removed type/bug The issue is confirmed as a bug. severity/moderate labels Dec 7, 2021

bb7133 unassigned AilinKid Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make TiDB server shutdown gracefully when PD is dead #18336

Make TiDB server shutdown gracefully when PD is dead #18336

tiancaiamao commented Jul 2, 2020

zimulala commented Jul 7, 2020

wjhuang2016 commented Sep 1, 2020

morgo commented Dec 7, 2020

AilinKid commented Dec 8, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

morgo commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020

bb7133 commented Dec 7, 2021

Make TiDB server shutdown gracefully when PD is dead #18336

Make TiDB server shutdown gracefully when PD is dead #18336

Comments

tiancaiamao commented Jul 2, 2020

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. Affected version (Required)

5. Root Cause Analysis

zimulala commented Jul 7, 2020

wjhuang2016 commented Sep 1, 2020

morgo commented Dec 7, 2020

AilinKid commented Dec 8, 2020 • edited Loading

AilinKid commented Dec 11, 2020 • edited Loading

Notice: After trail and error, we found it occurs in the master, v4.0, v3.0... (maybe we always got this phenomenon from every beginning)

AilinKid commented Dec 11, 2020 • edited Loading

AilinKid commented Dec 11, 2020 • edited Loading

AilinKid commented Dec 11, 2020 • edited Loading

morgo commented Dec 11, 2020 • edited Loading

AilinKid commented Dec 11, 2020

bb7133 commented Dec 7, 2021

AilinKid commented Dec 8, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

AilinKid commented Dec 11, 2020 •

edited

Loading

morgo commented Dec 11, 2020 •

edited

Loading