Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acquiring project from boskos failing for scalability oss jobs #14697

Closed
mm4tt opened this issue Oct 10, 2019 · 42 comments
Closed

Acquiring project from boskos failing for scalability oss jobs #14697

mm4tt opened this issue Oct 10, 2019 · 42 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@mm4tt
Copy link
Contributor

mm4tt commented Oct 10, 2019

Since yesterday all scalability jobs using boskos started to failing with this error:

main.go:319: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found

This has happened over 10 times since yesterday and it's causing release-blocking jobs to be failing consistently, e.g. https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100

I would appreciate if someone having access to boskos logs could check why we request for project-type=scalability-project are failing with resource not found.

/priority critical-urgent

@mm4tt mm4tt added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2019
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 10, 2019
mm4tt added a commit to mm4tt/test-infra that referenced this issue Oct 10, 2019
This reverts commit edc776c.

Boskos is broken due to kubernetes#14697

Will re-enable once it gets fixed.

Ref. kubernetes/perf-tests#650
@k8s-ci-robot
Copy link
Contributor

@hasheddan: The provided milestone is not valid for this repository. Milestones in this repository: [2020-goals, someday, v1.16, v1.17, v1.18]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasheddan
Copy link
Contributor

/milestone v1.17

@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 10, 2019
@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

/kind oncall-hotlist

/assign @sebastienvas @krzyzacy

Are is this expected?

$ kubectl --context=prow-builds logs -l app=boskos -n test-pods

{"error":"Resource gpu-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:42Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:43Z"}
{"level":"info","msg":"Updated resource k8s-jkns-e2e-gke-ci-canary","time":"2019-10-10T16:34:43Z"}
{"handler":"handleUpdate","level":"info","msg":"From 10.60.188.228:40078","time":"2019-10-10T16:34:43Z"}
{"handler":"handleStart","level":"info","msg":"From 10.61.17.230:54462","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-gci-gce-slow, dest busy","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Updated resource gce-up-c1-3-g1-4-up-mas","time":"2019-10-10T16:34:44Z"}
{"handler":"handleStart","level":"info","msg":"From 10.60.53.80:52630","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-autoscaling-vpa-actuation, dest busy","time":"2019-10-10T16:34:44Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:44Z"}

$ kubectl --context=prow-builds logs -l app=boskos-janitor -n test-pods

{"error":"no resource name kube-gke-upg-1-2-1-3-upg-clu-n","level":"warning","msg":"Update kube-gke-upg-1-2-1-3-upg-clu-n failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-c1-4-clat-up-mas","level":"warning","msg":"Update gke-up-c1-4-clat-up-mas failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-g1-2-c1-3-up-clu","level":"warning","msg":"Update gke-up-g1-2-c1-3-up-clu failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name k8s-gci-gke-slow-1-5","level":"warning","msg":"Update k8s-gci-gke-slow-1-5 failed","time":"2019-10-10T16:34:59Z"}

@k8s-ci-robot k8s-ci-robot added the kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall. label Oct 10, 2019
@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

What do these errors mean?

{
 insertId: "ld746eg13kr670"  
 jsonPayload: {
  error: "no resource name k8s-jkns-gci-gke-serial-1-4"   
  level: "warning"   
  msg: "Update k8s-jkns-gci-gke-serial-1-4 failed"   
 }
 labels: {…}  
 logName: "projects/k8s-prow-builds/logs/boskos-janitor"  
 receiveTimestamp: "2019-10-10T18:26:27.814170080Z"  
 resource: {
  labels: {
   cluster_name: "prow"    
   container_name: "boskos-janitor"    
   instance_id: "8529070823101903748"    
   namespace_id: "test-pods"    
   pod_id: "boskos-janitor-7b85969d5c-r45pj"    
   project_id: "k8s-prow-builds"    
   zone: "us-central1-f"    
  }
  type: "container"   
 }
 severity: "ERROR"  
 timestamp: "2019-10-10T18:26:24Z"  
}

@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

https://github.com/kubernetes/test-infra/commits/master/boskos last change was on Oct 4, which was part of the 1004 bump: #14598

@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

#14598 (comment) merged at Oct 4, 11:14a PST

@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

@BenTheElder
Copy link
Member

http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=now-2d&to=now non-GKE projects look happier now but GKE dirty still looks to be climbing, do we need to do #14708 for GKE as well?

@BenTheElder
Copy link
Member

per @krzyzacy it seems cleaning up the new resource types (routers) is a lot slower, bumping the janitor replicas seems to have worked ref #14708 by @fejta, sent a bump for the other GCP project (GKE project) janitor in #14714

@BenTheElder
Copy link
Member

BenTheElder commented Oct 10, 2019

GKE dirty projects have dramatically dropped from 145 to 3. I think we're good for now but probably will want #14715

@fejta fejta closed this as completed Oct 10, 2019
@fejta
Copy link
Contributor

fejta commented Oct 10, 2019

More work to do but as Ben says the outage is resolved.

mm4tt added a commit to mm4tt/test-infra that referenced this issue Oct 11, 2019
Boskos issues were fixed in
kubernetes#14697.
Let's try re-enabling.

Ref. kubernetes/perf-tests#650
@k8s-ci-robot
Copy link
Contributor

@mm4tt: Closing this issue.

In response to this:

Lost track of this one, we added more projects to the presubmit pool long time ago. Both our pools seem healthy as of now.
rNJH5V1siew

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BenTheElder
Copy link
Member

BenTheElder commented Feb 18, 2020

https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1

image

this does not look healthy to me.

@BenTheElder BenTheElder reopened this Feb 18, 2020
@BenTheElder BenTheElder removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2020
@fejta
Copy link
Contributor

fejta commented Feb 18, 2020

Some potential commits

commit d27cc13
Merge: 2c6e37d 7c0ac78
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Tue Feb 18 03:03:29 2020 -0800

Merge pull request #16335 from mm4tt/gce-100-containerd

Add experimental gce-100-perf job with containerd enabled

commit 7c0ac78
Author: Matt Matejczyk mmat@google.com
Date: Tue Feb 18 11:20:14 2020 +0100

Add experimental gce-100-perf job with containerd

commit 08bbe16
Author: Jakub Przychodzeń jprzychodzen@google.com
Date: Tue Feb 18 11:12:46 2020 +0100

[Access-tokens] Ramp up access-tokens test

commit 73cc13b
Merge: 51ad7a6 dfcf87c
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Mon Feb 17 03:45:28 2020 -0800

Merge pull request #16248 from jkaniuk/perf-tests-presubmits

Add presubmits for perf-tests release branches

commit 51ad7a6
Merge: 8654545 5e3b4be
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Mon Feb 17 01:03:29 2020 -0800

Merge pull request #16297 from jprzychodzen/at-roll-2

[Access-tokens] Rollout access-tokens test to seconds set of jobs

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 19, 2020

We have two pools, one for continuous jobs and another for presubmits.

SOojCcuzuHL

The continuous one seems pretty healthy, but the presubmit looks even worse now.
Only one commit you listed touched presubmit jobs. I'll take a look.

In the meantime, @jprzychodzen will add more projects to the presubmit pool to stop the bleeding.

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 19, 2020

Something seems off, we have 40 projects in the pool, as of now almost 100% seems to be busy.
But when I tried counting all running presubmit jobs I couldn't find more than 15:

https://prow.k8s.io/?type=presubmit&job=pull-kubernetes*performance*&state=pending
CoOR3dJF887
https://prow.k8s.io/?type=presubmit&job=pull-kubernetes*kubemark*&state=pending
fS4TmoGAZb6
https://prow.k8s.io/?type=presubmit&job=*clusterloader2*&state=pending
cQBU97yLeLv

Is it possible that boskos is somehow leaking the projects? Do we have any logs/metrics from boskos for our presubmit pool?

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 19, 2020

/cc @mborsz - scalability oncall
@fejta (test-infra) oncall is already assigned

@mborsz
Copy link
Member

mborsz commented Feb 19, 2020

This can be caused by #16338

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 19, 2020

The perf-test branch release presubmits have been added two days ago in #16248
I don't see anything wrong with them but it looks like they are causing the problems, so maybe we should consider reverting that PR to potentially stop the bleeding

@jprzychodzen
Copy link
Contributor

Adding new projects in #16358

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 19, 2020

To summarize the findings from #16338. Our understanding is that we have presubmit jobs for which prow is spawning pods continuously and they in turn eat up all the projects in the presubmit boskos pool. We'll try disabling these presubmits as a mitigation to unblock the k/k presubmits (currently all k/k PR are blocked by this).

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 20, 2020

Looks like our presubmit pool has recovered:

HhmdNgCOkGS

One more thing before we close this bug. We added 5 new projects to the pool in #16358 but most likely some boskos push or reload needs to happen before they get actually added.

@fejta, is this something you could help us with?

@ixdy
Copy link
Member

ixdy commented Feb 20, 2020

@mm4tt there's a job that automatically updates the boskos configmap, but it looks like it's misconfigured:

Another change in boskos/ just merged, so the config should be updated momentarily, but I'll fix the trigger for that job, too.

@ixdy
Copy link
Member

ixdy commented Feb 20, 2020

fix for the config updater job: #16395

@mm4tt
Copy link
Contributor Author

mm4tt commented Feb 21, 2020

Thanks @ixdy for taking a look and fixing the updater job.
I can see that our presubmit pool has been updated with the 5 extra project.
We can close this one.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests