-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acquiring project from boskos failing for scalability oss jobs #14697
Comments
This reverts commit edc776c. Boskos is broken due to kubernetes#14697 Will re-enable once it gets fixed. Ref. kubernetes/perf-tests#650
@hasheddan: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.17 |
/kind oncall-hotlist /assign @sebastienvas @krzyzacy Are is this expected? $ kubectl --context=prow-builds logs -l app=boskos -n test-pods
{"error":"Resource gpu-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:42Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:43Z"}
{"level":"info","msg":"Updated resource k8s-jkns-e2e-gke-ci-canary","time":"2019-10-10T16:34:43Z"}
{"handler":"handleUpdate","level":"info","msg":"From 10.60.188.228:40078","time":"2019-10-10T16:34:43Z"}
{"handler":"handleStart","level":"info","msg":"From 10.61.17.230:54462","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-gci-gce-slow, dest busy","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Updated resource gce-up-c1-3-g1-4-up-mas","time":"2019-10-10T16:34:44Z"}
{"handler":"handleStart","level":"info","msg":"From 10.60.53.80:52630","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-autoscaling-vpa-actuation, dest busy","time":"2019-10-10T16:34:44Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:44Z"}
$ kubectl --context=prow-builds logs -l app=boskos-janitor -n test-pods
{"error":"no resource name kube-gke-upg-1-2-1-3-upg-clu-n","level":"warning","msg":"Update kube-gke-upg-1-2-1-3-upg-clu-n failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-c1-4-clat-up-mas","level":"warning","msg":"Update gke-up-c1-4-clat-up-mas failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-g1-2-c1-3-up-clu","level":"warning","msg":"Update gke-up-g1-2-c1-3-up-clu failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name k8s-gci-gke-slow-1-5","level":"warning","msg":"Update k8s-gci-gke-slow-1-5 failed","time":"2019-10-10T16:34:59Z"} |
What do these errors mean? {
insertId: "ld746eg13kr670"
jsonPayload: {
error: "no resource name k8s-jkns-gci-gke-serial-1-4"
level: "warning"
msg: "Update k8s-jkns-gci-gke-serial-1-4 failed"
}
labels: {…}
logName: "projects/k8s-prow-builds/logs/boskos-janitor"
receiveTimestamp: "2019-10-10T18:26:27.814170080Z"
resource: {
labels: {
cluster_name: "prow"
container_name: "boskos-janitor"
instance_id: "8529070823101903748"
namespace_id: "test-pods"
pod_id: "boskos-janitor-7b85969d5c-r45pj"
project_id: "k8s-prow-builds"
zone: "us-central1-f"
}
type: "container"
}
severity: "ERROR"
timestamp: "2019-10-10T18:26:24Z"
} |
https://github.com/kubernetes/test-infra/commits/master/boskos last change was on Oct 4, which was part of the 1004 bump: #14598 |
#14598 (comment) merged at Oct 4, 11:14a PST |
So looks like we were missing a graph http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=now-7d&to=now&panelId=8&fullscreen |
http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=now-2d&to=now non-GKE projects look happier now but GKE dirty still looks to be climbing, do we need to do #14708 for GKE as well? |
GKE dirty projects have dramatically dropped from 145 to 3. I think we're good for now but probably will want #14715 |
More work to do but as Ben says the outage is resolved. |
Boskos issues were fixed in kubernetes#14697. Let's try re-enabling. Ref. kubernetes/perf-tests#650
@mm4tt: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1 this does not look healthy to me. |
Some potential commits commit d27cc13
commit 7c0ac78
commit 08bbe16
commit 73cc13b
commit 51ad7a6
|
We have two pools, one for continuous jobs and another for presubmits. The continuous one seems pretty healthy, but the presubmit looks even worse now. In the meantime, @jprzychodzen will add more projects to the presubmit pool to stop the bleeding. |
Something seems off, we have 40 projects in the pool, as of now almost 100% seems to be busy. https://prow.k8s.io/?type=presubmit&job=pull-kubernetes*performance*&state=pending Is it possible that boskos is somehow leaking the projects? Do we have any logs/metrics from boskos for our presubmit pool? |
This can be caused by #16338 |
The perf-test branch release presubmits have been added two days ago in #16248 |
Adding new projects in #16358 |
To summarize the findings from #16338. Our understanding is that we have presubmit jobs for which prow is spawning pods continuously and they in turn eat up all the projects in the presubmit boskos pool. We'll try disabling these presubmits as a mitigation to unblock the k/k presubmits (currently all k/k PR are blocked by this). |
fix for the config updater job: #16395 |
Thanks @ixdy for taking a look and fixing the updater job. /close |
Since yesterday all scalability jobs using boskos started to failing with this error:
This has happened over 10 times since yesterday and it's causing release-blocking jobs to be failing consistently, e.g. https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100
I would appreciate if someone having access to boskos logs could check why we request for project-type=scalability-project are failing with
resource not found
./priority critical-urgent
The text was updated successfully, but these errors were encountered: