This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
P0: Dependent objects are unexpectedly deleted #4117
Comments
yqwang-ms
changed the title
Dependent object are unexpected deleetd
Dependent objects are unexpected deleetd
Jan 8, 2020
yqwang-ms
changed the title
Dependent objects are unexpected deleetd
Dependent objects are unexpectedly deleted
Jan 8, 2020
54 tasks
yqwang-ms
changed the title
Dependent objects are unexpectedly deleted
P0: Dependent objects are unexpectedly deleted
Jan 8, 2020
abuccts
added a commit
that referenced
this issue
Jan 8, 2020
Remove framework owner reference for priority class, ref #4117.
This was referenced Jan 8, 2020
abuccts
added a commit
that referenced
this issue
Jan 9, 2020
Remove framework owner reference for priority class, ref #4117.
This was referenced Jan 9, 2020
abuccts
added a commit
that referenced
this issue
Jan 10, 2020
* add switch to enable priority class for job FIFO * remove framework owner reference for priority class, ref #4117
abuccts
added a commit
that referenced
this issue
Jan 10, 2020
Update job priority class * add switch to enable priority class for job FIFO * remove framework owner reference for priority class, ref #4117
@yqwang-ms , shall we close this issue? |
All sub items done, close it |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Short summary about the issue/question:
In PAI, we saw some times for V100LP and int beds, that priorityclass and configmap got unexpectedly deleted.
Root Cause Summary:
We have made the cluster-scoped priorityclass owner to be namespace-scoped framework, whose behavior is undefined.
For previous version v1.14.2 deployed by paictl, the behavior MAY be job hang in deletion.
For current version v1.15.6 deployed by kubespray, the behavior MAY be job dependent objects are unexpectedly deleted.
Fix Plan:
We should not make framework is the owner of priorityclass.
Above are tracked in Pure K8S Beta Release Plan - v0.17 #3872
In future (long term), we may need to replace priorityclass with queue sort plugin to achieve FIFO, however queue sort plugin is still an alpha feature.
Root Cause Analysis:
The whole story is
RestServer PATCH priorityclass's owner to be framework
GC controller checks if the priorityclass's owner exists, and because GC assume the cluster-scoped priorityclass's owner is also cluster-scoped. So check ApiServer with URL:
/apis/frameworkcontroller.microsoft.com/v1/frameworks/erpqjxb7f9m7wvk2cxu7crbc60rk0dv1bxtk4qv164t5ydhncgrg
, which is definitely wrong.For previous version v1.14.2 deployed by paictl, the ApiServer response is 400
I1107 08:57:17.913195 1 wrap.go:47] GET /apis/frameworkcontroller.microsoft.com/v1/frameworks/f5mqjubyf5mqjuazehjq6x2zehgq6uuze9qprtazdhmq6x2z6r: (151.3µs) 400 [kube-controller-manager/v1.14.2 (linux/amd64) kubernetes/66049e3/generic-garbage-collector 10.151.41.16:49658]
For current version v1.15.6 deployed by kubespray, the ApiServer response is 404
I0108 09:20:02.289055 1 wrap.go:47] GET /apis/frameworkcontroller.microsoft.com/v1/frameworks/erpqjxb7f9m7wvk2cxu7crbc60rk0dv1bxtk4qv164t5ydhncgrg: (183.698µs) 404 [kube-controller-manager/v1.15.6 (linux/amd64) kubernetes/7015f71/system:serviceaccount:kube-system:generic-garbage-collector 10.8.1.7:34546]
The response changed due to PR apiextensions: check request scope against CRD scope correctly kubernetes/kubernetes#80750 to fix CVE-2019-11247: API server allows access to custom resources via wrong scope kubernetes/kubernetes#80983.
GC controller will treat 400 as an err, so always retry it and so we may see job hang in deletion.
GC controller will treat 404 as object not found, so directly believes the framework object (assume its UID is U1) does not exist, so delete the priorityclass.
The GC believed "fact" U1 object does not exist is cached, so after this, if GC controller start to check the framework's configmap, it just use the cache, and believe the configmap's owner does not exist too (even it now can use correct URL to check with ApiServer), so delete the configmap.
The most time of the investigation is caused by
All Related Logs:
Job: https://v100lp.openp.ai/job-detail.html?username=v-yugzh&jobName=nbgtval0107a_s2_a12_65d1
framework name: erpqjxb7f9m7wvk2cxu7crbc60rk0dv1bxtk4qv164t5ydhncgrg
GC Controller v4 log:
ApiServer v4 log:
FC v2 log:
The text was updated successfully, but these errors were encountered: