-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TiDB server shutdown gracefully when PD is dead #18336
Comments
PTAL @AilinKid |
@AilinKid Any update? |
I think this is a duplicate of #10260 - please close both if you fix it :-) |
I've investigated for a while. 2: Besides, there many sub-goroutines that need oracle ts to push forward their job, while the lost PD will cause the RPC request to fail, the backoff ctx (constructed with context.Background ) is the only channel that can break their loop. Seems hard to find an elegant way to exit it without PD... |
Notice: After trail and error, we found it occurs in the master, v4.0, v3.0... (maybe we always got this phenomenon from every beginning)we tried:
|
But letting etcdcli close firstly is not always a good choice, because you can not tell whether it is caused by PD death. If the PD is good, this code will ignore much information cleaning in the ETCD, and stats related info won't be stored in the TiKV, and ... Judging whether PD is dead is subjective to your mind, you can send PD request for testing. However, the connection refused can also be caused by unstable network isolation. |
So I think, for a subjective extreme case (PD is dead), letting the user close TiDB forcibly by |
The use case I was thinking about is that the In a distributed system it is hard to ensure order, so its nice if shutdown can have the same properties as startup. |
Make sense, maybe we can change this issue as a feature request and try to figure out an elegant way to do this. |
I decide to make a feature request instead of a bug. |
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Run a cluster, kill pd, then kill tidb-server (Ctrl - C)
2. What did you expect to see? (Required)
tidb-server exit
3. What did you see instead (Required)
The process print a log of error log and never exit.
kill -USR1 pid
to get the goroutine stack:It is block on domain.Close, and waiting for ownerManager to exit.
However, ownerManager is doing its CampaignOwner loop and it seems this loop never end ...
4. Affected version (Required)
master f31298f
5. Root Cause Analysis
The text was updated successfully, but these errors were encountered: