Acquiring project from boskos failing for scalability oss jobs #14697

mm4tt · 2019-10-10T09:01:33Z

Since yesterday all scalability jobs using boskos started to failing with this error:

main.go:319: Something went wrong: failed to prepare test environment: --provider=gce boskos failed to acquire project: resources not found

This has happened over 10 times since yesterday and it's causing release-blocking jobs to be failing consistently, e.g. https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100

I would appreciate if someone having access to boskos logs could check why we request for project-type=scalability-project are failing with resource not found.

/priority critical-urgent

The text was updated successfully, but these errors were encountered:

This reverts commit edc776c. Boskos is broken due to kubernetes#14697 Will re-enable once it gets fixed. Ref. kubernetes/perf-tests#650

k8s-ci-robot · 2019-10-10T14:15:46Z

@hasheddan: The provided milestone is not valid for this repository. Milestones in this repository: [2020-goals, someday, v1.16, v1.17, v1.18]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hasheddan · 2019-10-10T14:16:57Z

/milestone v1.17

fejta · 2019-10-10T16:36:04Z

/kind oncall-hotlist

/assign @sebastienvas @krzyzacy

Are is this expected?

$ kubectl --context=prow-builds logs -l app=boskos -n test-pods

{"error":"Resource gpu-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:42Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:43Z"}
{"level":"info","msg":"Updated resource k8s-jkns-e2e-gke-ci-canary","time":"2019-10-10T16:34:43Z"}
{"handler":"handleUpdate","level":"info","msg":"From 10.60.188.228:40078","time":"2019-10-10T16:34:43Z"}
{"handler":"handleStart","level":"info","msg":"From 10.61.17.230:54462","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-gci-gce-slow, dest busy","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Updated resource gce-up-c1-3-g1-4-up-mas","time":"2019-10-10T16:34:44Z"}
{"handler":"handleStart","level":"info","msg":"From 10.60.53.80:52630","time":"2019-10-10T16:34:44Z"}
{"level":"info","msg":"Request for a free gce-project from ci-kubernetes-e2e-autoscaling-vpa-actuation, dest busy","time":"2019-10-10T16:34:44Z"}
{"error":"Resource gce-project not exist","level":"error","msg":"No available resource","time":"2019-10-10T16:34:44Z"}

$ kubectl --context=prow-builds logs -l app=boskos-janitor -n test-pods

{"error":"no resource name kube-gke-upg-1-2-1-3-upg-clu-n","level":"warning","msg":"Update kube-gke-upg-1-2-1-3-upg-clu-n failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-c1-4-clat-up-mas","level":"warning","msg":"Update gke-up-c1-4-clat-up-mas failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name gke-up-g1-2-c1-3-up-clu","level":"warning","msg":"Update gke-up-g1-2-c1-3-up-clu failed","time":"2019-10-10T16:34:59Z"}
{"error":"no resource name k8s-gci-gke-slow-1-5","level":"warning","msg":"Update k8s-gci-gke-slow-1-5 failed","time":"2019-10-10T16:34:59Z"}

fejta · 2019-10-10T16:41:03Z

http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=1570120846878&to=1570725646878

fejta · 2019-10-10T18:28:11Z

What do these errors mean?

{
 insertId: "ld746eg13kr670"  
 jsonPayload: {
  error: "no resource name k8s-jkns-gci-gke-serial-1-4"   
  level: "warning"   
  msg: "Update k8s-jkns-gci-gke-serial-1-4 failed"   
 }
 labels: {…}  
 logName: "projects/k8s-prow-builds/logs/boskos-janitor"  
 receiveTimestamp: "2019-10-10T18:26:27.814170080Z"  
 resource: {
  labels: {
   cluster_name: "prow"    
   container_name: "boskos-janitor"    
   instance_id: "8529070823101903748"    
   namespace_id: "test-pods"    
   pod_id: "boskos-janitor-7b85969d5c-r45pj"    
   project_id: "k8s-prow-builds"    
   zone: "us-central1-f"    
  }
  type: "container"   
 }
 severity: "ERROR"  
 timestamp: "2019-10-10T18:26:24Z"  
}

fejta · 2019-10-10T18:32:31Z

https://github.com/kubernetes/test-infra/commits/master/boskos last change was on Oct 4, which was part of the 1004 bump: #14598

fejta · 2019-10-10T18:34:58Z

#14598 (comment) merged at Oct 4, 11:14a PST

fejta · 2019-10-10T18:50:29Z

So looks like we were missing a graph http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=now-7d&to=now&panelId=8&fullscreen

BenTheElder · 2019-10-10T20:55:23Z

http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1&from=now-2d&to=now non-GKE projects look happier now but GKE dirty still looks to be climbing, do we need to do #14708 for GKE as well?

BenTheElder · 2019-10-10T21:01:34Z

per @krzyzacy it seems cleaning up the new resource types (routers) is a lot slower, bumping the janitor replicas seems to have worked ref #14708 by @fejta, sent a bump for the other GCP project (GKE project) janitor in #14714

BenTheElder · 2019-10-10T21:42:29Z

GKE dirty projects have dramatically dropped from 145 to 3. I think we're good for now but probably will want #14715

fejta · 2019-10-10T23:14:39Z

More work to do but as Ben says the outage is resolved.

Boskos issues were fixed in kubernetes#14697. Let's try re-enabling. Ref. kubernetes/perf-tests#650

k8s-ci-robot · 2020-02-18T07:52:41Z

@mm4tt: Closing this issue.

In response to this:

Lost track of this one, we added more projects to the presubmit pool long time ago. Both our pools seem healthy as of now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder · 2020-02-18T19:46:24Z

https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1

this does not look healthy to me.

fejta · 2020-02-18T21:29:59Z

Some potential commits

commit d27cc13
Merge: 2c6e37d 7c0ac78
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Tue Feb 18 03:03:29 2020 -0800

Merge pull request #16335 from mm4tt/gce-100-containerd

Add experimental gce-100-perf job with containerd enabled

commit 7c0ac78
Author: Matt Matejczyk mmat@google.com
Date: Tue Feb 18 11:20:14 2020 +0100

Add experimental gce-100-perf job with containerd

commit 08bbe16
Author: Jakub Przychodzeń jprzychodzen@google.com
Date: Tue Feb 18 11:12:46 2020 +0100

[Access-tokens] Ramp up access-tokens test

commit 73cc13b
Merge: 51ad7a6 dfcf87c
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Mon Feb 17 03:45:28 2020 -0800

Merge pull request #16248 from jkaniuk/perf-tests-presubmits

Add presubmits for perf-tests release branches

commit 51ad7a6
Merge: 8654545 5e3b4be
Author: Kubernetes Prow Robot k8s-ci-robot@users.noreply.github.com
Date: Mon Feb 17 01:03:29 2020 -0800

Merge pull request #16297 from jprzychodzen/at-roll-2

[Access-tokens] Rollout access-tokens test to seconds set of jobs

mm4tt · 2020-02-19T08:06:56Z

We have two pools, one for continuous jobs and another for presubmits.

The continuous one seems pretty healthy, but the presubmit looks even worse now.
Only one commit you listed touched presubmit jobs. I'll take a look.

In the meantime, @jprzychodzen will add more projects to the presubmit pool to stop the bleeding.

mm4tt · 2020-02-19T08:21:51Z

Something seems off, we have 40 projects in the pool, as of now almost 100% seems to be busy.
But when I tried counting all running presubmit jobs I couldn't find more than 15:

https://prow.k8s.io/?type=presubmit&job=pull-kubernetes*performance*&state=pending

https://prow.k8s.io/?type=presubmit&job=pull-kubernetes*kubemark*&state=pending

https://prow.k8s.io/?type=presubmit&job=*clusterloader2*&state=pending

Is it possible that boskos is somehow leaking the projects? Do we have any logs/metrics from boskos for our presubmit pool?

mm4tt · 2020-02-19T08:30:49Z

/cc @mborsz - scalability oncall
@fejta (test-infra) oncall is already assigned

mborsz · 2020-02-19T08:54:20Z

This can be caused by #16338

mm4tt · 2020-02-19T09:39:01Z

The perf-test branch release presubmits have been added two days ago in #16248
I don't see anything wrong with them but it looks like they are causing the problems, so maybe we should consider reverting that PR to potentially stop the bleeding

jprzychodzen · 2020-02-19T10:13:54Z

Adding new projects in #16358

mm4tt · 2020-02-19T10:20:16Z

To summarize the findings from #16338. Our understanding is that we have presubmit jobs for which prow is spawning pods continuously and they in turn eat up all the projects in the presubmit boskos pool. We'll try disabling these presubmits as a mitigation to unblock the k/k presubmits (currently all k/k PR are blocked by this).

mm4tt · 2020-02-20T08:25:42Z

Looks like our presubmit pool has recovered:

One more thing before we close this bug. We added 5 new projects to the pool in #16358 but most likely some boskos push or reload needs to happen before they get actually added.

@fejta, is this something you could help us with?

ixdy · 2020-02-20T17:46:32Z

@mm4tt there's a job that automatically updates the boskos configmap, but it looks like it's misconfigured:

test-infra/config/jobs/kubernetes/test-infra/test-infra-trusted.yaml

Line 617 in 47050c4

run_if_changed: '^boskos/.*$'

Another change in boskos/ just merged, so the config should be updated momentarily, but I'll fix the trigger for that job, too.

ixdy · 2020-02-20T17:58:30Z

fix for the config updater job: #16395

mm4tt · 2020-02-21T08:13:08Z

Thanks @ixdy for taking a look and fixing the updater job.
I can see that our presubmit pool has been updated with the 5 extra project.
We can close this one.

/close

mm4tt added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2019

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 10, 2019

mm4tt added a commit to mm4tt/test-infra that referenced this issue Oct 10, 2019

Revert "Use boskos for k8s.io/perf-tests presubmits"

e4f667a

This reverts commit edc776c. Boskos is broken due to kubernetes#14697 Will re-enable once it gets fixed. Ref. kubernetes/perf-tests#650

mm4tt mentioned this issue Oct 10, 2019

Revert "Use boskos for k8s.io/perf-tests presubmits" #14699

Merged

k8s-ci-robot added this to the v1.17 milestone Oct 10, 2019

k8s-ci-robot added the kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall. label Oct 10, 2019

k8s-ci-robot assigned krzyzacy and sebastienvas Oct 10, 2019

fejta mentioned this issue Oct 10, 2019

Update to 8 nongke janitor replicas #14708

Merged

BenTheElder mentioned this issue Oct 10, 2019

add more gke project janitor replicas #14714

Merged

BenTheElder mentioned this issue Oct 10, 2019

monitor boskos cleanup timing #14715

Closed

fejta closed this as completed Oct 10, 2019

mm4tt added a commit to mm4tt/test-infra that referenced this issue Oct 11, 2019

Enable boskos in k8s.io/perf-test presubmits

84ae9e5

Boskos issues were fixed in kubernetes#14697. Let's try re-enabling. Ref. kubernetes/perf-tests#650

mm4tt mentioned this issue Oct 11, 2019

Enable boskos in k8s.io/perf-test presubmits #14725

Merged

BenTheElder reopened this Feb 18, 2020

BenTheElder removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2020

BenTheElder assigned mm4tt and unassigned Katharine, krzyzacy and sebastienvas Feb 18, 2020

mm4tt mentioned this issue Feb 19, 2020

scalability-presubmit-project boskos pool is overutilized kubernetes/perf-tests#1034

Closed

mm4tt mentioned this issue Feb 19, 2020

Add 5 projects to 'scalability-presubmit-project' boskos pool #16358

Merged

mborsz mentioned this issue Feb 19, 2020

Revert "Add presubmits for perf-tests release branches" #16360

Merged

mm4tt closed this as completed Feb 21, 2020

jprzychodzen mentioned this issue Feb 27, 2020

REQUEST: New membership for jprzychodzen kubernetes/org#1664

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acquiring project from boskos failing for scalability oss jobs #14697

Acquiring project from boskos failing for scalability oss jobs #14697

mm4tt commented Oct 10, 2019

k8s-ci-robot commented Oct 10, 2019

hasheddan commented Oct 10, 2019

fejta commented Oct 10, 2019 •

edited

Loading

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019 •

edited

Loading

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019

BenTheElder commented Oct 10, 2019

BenTheElder commented Oct 10, 2019

BenTheElder commented Oct 10, 2019 •

edited

Loading

fejta commented Oct 10, 2019

k8s-ci-robot commented Feb 18, 2020

BenTheElder commented Feb 18, 2020 •

edited

Loading

fejta commented Feb 18, 2020

mm4tt commented Feb 19, 2020

mm4tt commented Feb 19, 2020 •

edited

Loading

mm4tt commented Feb 19, 2020

mborsz commented Feb 19, 2020

mm4tt commented Feb 19, 2020

jprzychodzen commented Feb 19, 2020

mm4tt commented Feb 19, 2020

mm4tt commented Feb 20, 2020

ixdy commented Feb 20, 2020

ixdy commented Feb 20, 2020

mm4tt commented Feb 21, 2020

Acquiring project from boskos failing for scalability oss jobs #14697

Acquiring project from boskos failing for scalability oss jobs #14697

Comments

mm4tt commented Oct 10, 2019

k8s-ci-robot commented Oct 10, 2019

hasheddan commented Oct 10, 2019

fejta commented Oct 10, 2019 • edited Loading

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019 • edited Loading

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019

fejta commented Oct 10, 2019

BenTheElder commented Oct 10, 2019

BenTheElder commented Oct 10, 2019

BenTheElder commented Oct 10, 2019 • edited Loading

fejta commented Oct 10, 2019

k8s-ci-robot commented Feb 18, 2020

BenTheElder commented Feb 18, 2020 • edited Loading

fejta commented Feb 18, 2020

mm4tt commented Feb 19, 2020

mm4tt commented Feb 19, 2020 • edited Loading

mm4tt commented Feb 19, 2020

mborsz commented Feb 19, 2020

mm4tt commented Feb 19, 2020

jprzychodzen commented Feb 19, 2020

mm4tt commented Feb 19, 2020

mm4tt commented Feb 20, 2020

ixdy commented Feb 20, 2020

ixdy commented Feb 20, 2020

mm4tt commented Feb 21, 2020

fejta commented Oct 10, 2019 •

edited

Loading

fejta commented Oct 10, 2019 •

edited

Loading

BenTheElder commented Oct 10, 2019 •

edited

Loading

BenTheElder commented Feb 18, 2020 •

edited

Loading

mm4tt commented Feb 19, 2020 •

edited

Loading