-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tidb-scheduler stuck filtering #468
Comments
The TiDB Operator can emit events now: #427 |
The latest operator will log this message: https://github.com/pingcap/tidb-operator/blob/master/pkg/scheduler/scheduler.go#L78 So you may be using an old version operator. |
We will release the |
The clusters are destroyed now. Originally I meant to just destroy one, but the destruction of one fixed the other. I will try reproducing thiw with the new operator, but I believe the issue is still present in the new code that when provisioning two clusters they both can get stuck due to a problematic node even though at least one of them should be succeeding. |
Does this problem still exist in the |
@weekface I can confirm that in beta3 one cluster scheduling issue will block a second cluster from being scheduled. |
same with #602, closing this. |
Co-authored-by: DanielZhangQD <36026334+DanielZhangQD@users.noreply.github.com>
operator version: the latest stable release
@jlerche started two clusters at once. Both were in a stuck state and had a failed PD member, and one had 3 failed TiDB members [1] and the other had its TiKV pods stuck in pending.
The scheduler logs showed a lot of errors. I don't remember seeing so many errors in kube-scheduler before. [2] The tidb-scheduler logs were a lot simpler [3]. I noticed error level logging. Looking at the source code, there is an error logged when a 500 is returned, so we have the erorr from the kube-scheduler 500.
I would have said that this was just a network issue. However, when we deleted the second cluster, the first cluster immediately finished coming up successfully.
So my first question is: can the failure to schedule one cluster interfere with another being scheduled? I think the answer is yes in that there could be a problem node used by another cluster?
It seems like I should be seeing more information in the logs and maybe some events?
We have destroyed the clusters now, but are willing to try to reproduce the issue again.
[1] failed members
[2] kube-scheduler errors
[3] tidb-scheduler logs
The text was updated successfully, but these errors were encountered: