Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TiDB server shutdown gracefully when PD is dead #18336

Open
tiancaiamao opened this issue Jul 2, 2020 · 11 comments
Open

Make TiDB server shutdown gracefully when PD is dead #18336

tiancaiamao opened this issue Jul 2, 2020 · 11 comments
Labels
sig/sql-infra SIG: SQL Infra type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@tiancaiamao
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Run a cluster, kill pd, then kill tidb-server (Ctrl - C)

2. What did you expect to see? (Required)

tidb-server exit

3. What did you see instead (Required)

The process print a log of error log and never exit.

kill -USR1 pid to get the goroutine stack:

goroutine 1 [semacquire, 17 minutes]:
sync.runtime_Semacquire(0xc0004798b0)
        /media/genius/OS/project/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0004798a8)
        /media/genius/OS/project/go/src/sync/waitgroup.go:130 +0x64
github.com/pingcap/tidb/owner.(*ownerManager).Cancel(0xc000479830)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:121 +0x3d
github.com/pingcap/tidb/ddl.(*ddl).close(0xc0001b7f20)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:365 +0xbc
github.com/pingcap/tidb/ddl.(*ddl).Stop(0xc0001b7f20, 0x0, 0x0)
        /media/genius/OS/project/src/github.com/pingcap/tidb/ddl/ddl.go:296 +0x95
github.com/pingcap/tidb/domain.(*Domain).Close(0xc0005d2900)
        /media/genius/OS/project/src/github.com/pingcap/tidb/domain/domain.go:608 +0x2bf
main.closeDomainAndStorage()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:718 +0x3f
main.cleanup()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:730 +0x80
main.main()
        /media/genius/OS/project/src/github.com/pingcap/tidb/tidb-server/main.go:188 +0x1f1

It is block on domain.Close, and waiting for ownerManager to exit.
However, ownerManager is doing its CampaignOwner loop and it seems this loop never end ...

go.etcd.io/etcd/clientv3.(*txn).Commit(0xc000bff4d0, 0x0, 0x0, 0x0)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/txn.go:146 +0x16f
go.etcd.io/etcd/clientv3/concurrency.(*Election).Resign(0xc000690540, 0x3adc4a0, 0xc0004d3f00, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:138 +0x3b6
go.etcd.io/etcd/clientv3/concurrency.(*Election).Campaign(0xc000690540, 0x3adc4a0, 0xc000690500, 0xc000052c90, 0x24, 0xc000397fe0, 0x13)
        /media/genius/OS/project/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/clientv3/concurrency/election.go:98 +0x798
github.com/pingcap/tidb/owner.(*ownerManager).campaignLoop(0xc000479830, 0xc000678570)
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:274 +0x6fe
created by github.com/pingcap/tidb/owner.(*ownerManager).CampaignOwner
        /media/genius/OS/project/src/github.com/pingcap/tidb/owner/manager.go:194 +0x343

4. Affected version (Required)

master f31298f

5. Root Cause Analysis

@tiancaiamao tiancaiamao added type/bug The issue is confirmed as a bug. sig/sql-infra SIG: SQL Infra labels Jul 2, 2020
@zimulala
Copy link
Contributor

zimulala commented Jul 7, 2020

PTAL @AilinKid

@wjhuang2016
Copy link
Member

@AilinKid Any update?

@AilinKid AilinKid self-assigned this Dec 7, 2020
@morgo
Copy link
Contributor

morgo commented Dec 7, 2020

I think this is a duplicate of #10260 - please close both if you fix it :-)

@AilinKid
Copy link
Contributor

AilinKid commented Dec 8, 2020

I've investigated for a while.
1: the domain got some work to do before close TiDB, such as clean some registered information in PD, the ETCD campaign needs pdclient.done() signal to break the loop, However, the PD client is closed in the store layer which is after the domain close action. Putting it ahead seems valid but not that elegant because skipping clean is not a good manner.

2: Besides, there many sub-goroutines that need oracle ts to push forward their job, while the lost PD will cause the RPC request to fail, the backoff ctx (constructed with context.Background ) is the only channel that can break their loop.

Seems hard to find an elegant way to exit it without PD...

@AilinKid
Copy link
Contributor

AilinKid commented Dec 11, 2020

Notice: After trail and error, we found it occurs in the master, v4.0, v3.0... (maybe we always got this phenomenon from every beginning)

we tried:

func (do *Domain) Close() {
	if do == nil {
		return
	}
	startTime := time.Now()
	
        // if you are sure the linked pb is dead, you can put etcdcli close action forward here. 
        // because the campaign loop /  ddl owner checker ... will use this client to send 
        // request, which will stuck in the block, unless you close the context in the etcdcli
        // actively.
        // After you clicking the ctrl+c in the TiDB, you will still need to wait about 3 minutes
        // because these background goutinues have some retry counts and try to pull it up.
        // Then, the domain will close, then store will close, finally TiDB will close.

	if do.etcdClient != nil {
		terror.Log(errors.Trace(do.etcdClient.Close()))
	}

	if do.ddl != nil {
		terror.Log(do.ddl.Stop())
	}

@AilinKid
Copy link
Contributor

AilinKid commented Dec 11, 2020

But letting etcdcli close firstly is not always a good choice, because you can not tell whether it is caused by PD death. If the PD is good, this code will ignore much information cleaning in the ETCD, and stats related info won't be stored in the TiKV, and ...

Judging whether PD is dead is subjective to your mind, you can send PD request for testing. However, the connection refused can also be caused by unstable network isolation.

@AilinKid
Copy link
Contributor

AilinKid commented Dec 11, 2020

So I think, for a subjective extreme case (PD is dead), letting the user close TiDB forcibly by kill -9 PID is much understandable. The lost info in the TiDB instance is also expected for the user.

@AilinKid
Copy link
Contributor

AilinKid commented Dec 11, 2020

@morgo
Copy link
Contributor

morgo commented Dec 11, 2020

The use case I was thinking about is that the pd-server is dead because of an incorrect shutdown order. i.e. you want to shutdown all components, but pd-server shuts down first. tidb-server and tikv-server already correctly handle the case of incorrect startup order (they will spin waiting for pd-server).

In a distributed system it is hard to ensure order, so its nice if shutdown can have the same properties as startup.

@AilinKid
Copy link
Contributor

Make sense, maybe we can change this issue as a feature request and try to figure out an elegant way to do this.

@bb7133
Copy link
Member

bb7133 commented Dec 7, 2021

I decide to make a feature request instead of a bug.

@bb7133 bb7133 changed the title Close PD first and tidb-server fail to exit Make TiDB server shutdown gracefully when PD is dead Dec 7, 2021
@bb7133 bb7133 added type/feature-request Categorizes issue or PR as related to a new feature. and removed type/bug The issue is confirmed as a bug. severity/moderate labels Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/sql-infra SIG: SQL Infra type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

8 participants