Boskos seems to be wedged #186

bobcatfish · 2020-01-18T02:57:48Z

Expected Behavior

Boskos should clean up projects once they are done being use and make them available for future use.

Actual Behavior

tektoncd/pipeline#1541 and tektoncd/pipeline#1888 both have consistently failing integration tests with an error like:

I0117 21:16:30.477] 2020/01/17 21:16:30 main.go:734: provider gke, will acquire project type gke-project from boskos
I0117 21:21:30.475] 2020/01/17 21:21:30 main.go:316: Something went wrong: failed to prepare test environment: --provider=gke boskos failed to acquire project: resources not found

In #29 and other times in the past we have responded to this error by provisioning more projects for boskos.

This time though it's definitely not the case that all the projects are in use:

https://pantheon.corp.google.com/home/activity?project=tekton-prow-9 <-- had current activity
https://pantheon.corp.google.com/home/activity?project=tekton-prow-10 <-- hasnt had an activity since the 14th

When I look at the logs from the boskos Janitor I see this kind of error:

 msg: "failed to clean up project tekton-prow-10, error info: Activated service account credentials for: [prow-account@tekton-releases.iam.gserviceaccount.com]
ERROR: (gcloud.compute.instances.list) Some requests did not succeed:
 - Invalid value for field 'zone': 'asia-northeast3-a'. Unknown zone.
 - Invalid value for field 'zone': 'asia-northeast3-b'. Unknown zone.
 - Invalid value for field 'zone': 'asia-northeast3-c'. Unknown zone.

Fail to list resource 'instances' from project 'tekton-prow-10'
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global 

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS
Error try to delete resources: CalledProcessError()
ERROR: (gcloud.container.clusters.list) ResponseError: code=404, message=Not Found.
[=== Start Janitor on project 'tekton-prow-10' ===]
[=== Activating service_account /etc/test-account/service-account.json ===]
[=== Finish Janitor on project 'tekton-prow-10' with status 1 ===]

I think the gcloud error might be a red herring, maybe a state that boskos gets into after some other kind of error first.

CPU and memory usage for both boskos + the boskos janitor started going up a few hours ago but its hard to say if that is causing the problem or if the problem is causing it:

Also this particular janitor pod has been steadily using more and more memory (interestingly this one was started on Jan 6 but the other two janitor pods had been around since like may)

The other 2 janitor pods look like:

Additional Info

boskos/janitor: track when cleanup fails repeatedly for the same resource kubernetes/test-infra#15866 is related (detecting when janitor fails repeatedly)
Router quota exceeded error causing GCE tests to fail kubernetes/test-infra#14611 hints that this might be a quota problem - we're definitely near our cpu limit so ive requested that increased

I couldn't find any other quotas that seemed like they needed increasing. I think there's a good job that boskos got into a bad state and just restarting everything will fix it.

coincidentally there was a (seemingly unrelated?) GCP outage at the time when these errors started: https://status.cloud.google.com/incident/zall/20001 So maybe that put things into a bad state

it's also possible that this is because we're using such an old version of boskos and it might need an update - i think there's a good chance that updating boskos will solve the whole thing but I didn't want to rush to do that since we might run into other problems.

The text was updated successfully, but these errors were encountered:

bobcatfish · 2020-01-18T03:21:24Z

Can see these failures happening consistently across PRs:

Trying to address tektoncd#186 but it does nothing

bobcatfish · 2020-01-18T03:24:27Z

I tried some things but nothing has worked:

I deleted all the boskos + boskos janitor pods so they had to be re-created (didnt try the reaper tho)
Tried manually adding a additional project in case that helped Flailing around trying to get boskos to work #187 but no dice

bobcatfish · 2020-01-18T03:26:51Z

I think the next thing to try is to update the boskos images to something newer: maybe the reason that outage + this started at the same time was that there was a rollout of something that is no longer compatible with our ancient boskos images (and their gcloud install?)

afrittoli · 2020-01-18T07:23:03Z

Thank you for the detailed analysis!
I might be able to try and update boskos in the Prow cluster later today, unless someone else can do that befor.

afrittoli · 2020-01-18T11:02:33Z

The issue seems to be still there, since I can see still a lot of failures in https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-pipeline-integration-tests.
I tried a \retest on one PR, and it was able to get a cluster from Boskos though.
Checking boskos logs, this shows up continuously:

{\"type\":\"gke-project\",\"name\":\"tekton-prow-13\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:25.864824733Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-5\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:26.78376571Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-12\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:28.135215444Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-7\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:28.455754565Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-11\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:29.393499993Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-1\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:31.945974382Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-3\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:35.84089279Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-4\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:35.926420242Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-2\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:38.518577986Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-9\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:38.586946232Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-6\",\"state\":\"busy\",\"owner\":\"pull-tekton-pipeline-integration-tests\",\"lastupdate\":\"2020-01-18T10:42:54.252574171Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-14\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:23.243993663Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-10\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:27.643677059Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-0\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:30.207256322Z\",\"userdata\":{}}]"

The busy project is the one used for my retest, but there are three projects that are dirty with no owner, that boskos keeps trying to reset, but with no luck.
I tried deleting the boskos-reaper pod, which was 254d old, but it doesn't seem to help.

For project14 specifically, something seems to be wrong with the setup:

ERROR: (gcloud.compute.sole-tenancy.node-templates.list) HTTPError 403: Access Not Configured. Compute Engine API has not been used in project [censored] before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=[censored] then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.\nFail to list resource 'sole-tenancy' from project 'tekton-prow-14'

vdemeester · 2020-01-20T10:16:54Z

/area boskos
/kind bug
/area test-infra

afrittoli · 2020-01-20T11:08:16Z

I tried updating boskos to the latest image available v20190621-ff01381, but the error in the janitor logs persists:

{"error":"exit status 1","level":"error","msg":"failed to clean up project tekton-prow-10, error info: Activated service account credentials for: [prow-account@tekton-releases.iam.gserviceaccount.com]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.container.clusters.list) ResponseError: code=404, message=Not Found.\n[=== Start Janitor on project 'tekton-prow-10' ===]\n[=== Activating service_account /etc/test-account/service-account.json ===]\n[=== Finish Janitor on project 'tekton-prow-10' with status 1 ===]\n","time":"2020-01-20T11:02:46Z"}

afrittoli · 2020-01-20T11:15:10Z

Looking at tekton-prow-10 via the console, there is no k8s cluster in the project, but there are two PVC backing disks left around:

I don't have permissions to delete them manually - doing so might unblock the project until we sort out the issue on boskos side.

Update the container image used by Boskos components in an attempt to solve tektoncd#186.

Update the container image used by Boskos components in an attempt to solve #186.

bobcatfish · 2020-01-21T15:14:19Z

I've deleted the disks from projects 10 and 0!

bobcatfish · 2020-01-21T19:24:34Z

Looks like lots of folks using boskos ran into problems on Friday: kubernetes/test-infra#15951

bobcatfish · 2020-01-21T23:08:10Z

Okay so I looked into it a bit more. When I looked today after deleting the disks that @afrittoli noticed were not deleted (TODO for myself: give everyone more permissions!! or we can move to a model where maybe the CDF owns these clusters and everyone can be a full admin!!! ANYWAY) and all of the clusters were "free".

I (finally) noticed that the Prow folks ran into these same errors on Friday resulting in kubernetes/test-infra#15951. It looks like the consensus is that gcloud itself had an outage (i.e. the API it communicates with).

I also updated to the latest boskos images that the Prow folks are using (and noticed #193) but we should be good to go now!

Using the same boskos version the prow folks are using, e.g., https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65 which they bumped to in the context of dealing with the same issue we ran into on Friday (tektoncd#186)

Using the same boskos version the prow folks are using, e.g., https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65 which they bumped to in the context of dealing with the same issue we ran into on Friday (#186)

bobcatfish mentioned this issue Jan 18, 2020

fix panic in github interceptor tektoncd/triggers#357

Merged

3 tasks

bobcatfish added a commit to bobcatfish/plumbing that referenced this issue Jan 18, 2020

Flailing around trying to get boskos to work

45e2d2c

Trying to address tektoncd#186 but it does nothing

bobcatfish mentioned this issue Jan 18, 2020

Flailing around trying to get boskos to work #187

Closed

tekton-robot added area/boskos Issues or PRs related to code in /boskos kind/bug Categorizes issue or PR as related to a bug. area/test-infra Issues or PRs related to the testing infrastructure labels Jan 20, 2020

afrittoli added a commit to afrittoli/plumbing that referenced this issue Jan 20, 2020

Update the boskos image

1ea87d3

Update the container image used by Boskos components in an attempt to solve tektoncd#186.

afrittoli mentioned this issue Jan 20, 2020

Update the boskos image #188

Merged

tekton-robot pushed a commit that referenced this issue Jan 20, 2020

Update the boskos image

9fa6a75

Update the container image used by Boskos components in an attempt to solve #186.

bobcatfish closed this as completed Jan 21, 2020

bobcatfish mentioned this issue Jan 21, 2020

Use the latest boskos ✨ #194

Merged

1 task

lukasschwab mentioned this issue Jan 22, 2020

Invalid value for field 'region': 'asia-northeast3'. Unknown region." googleapis/nodejs-compute#390

Closed

dibyom mentioned this issue Aug 19, 2020

E2E tests failing on Boskos #534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boskos seems to be wedged #186

Boskos seems to be wedged #186

bobcatfish commented Jan 18, 2020 •

edited

Loading

bobcatfish commented Jan 18, 2020

bobcatfish commented Jan 18, 2020

bobcatfish commented Jan 18, 2020

afrittoli commented Jan 18, 2020

afrittoli commented Jan 18, 2020

vdemeester commented Jan 20, 2020

afrittoli commented Jan 20, 2020

afrittoli commented Jan 20, 2020

bobcatfish commented Jan 21, 2020

bobcatfish commented Jan 21, 2020

bobcatfish commented Jan 21, 2020

Boskos seems to be wedged #186

Boskos seems to be wedged #186

Comments

bobcatfish commented Jan 18, 2020 • edited Loading

Expected Behavior

Actual Behavior

Additional Info

bobcatfish commented Jan 18, 2020

bobcatfish commented Jan 18, 2020

bobcatfish commented Jan 18, 2020

afrittoli commented Jan 18, 2020

afrittoli commented Jan 18, 2020

vdemeester commented Jan 20, 2020

afrittoli commented Jan 20, 2020

afrittoli commented Jan 20, 2020

bobcatfish commented Jan 21, 2020

bobcatfish commented Jan 21, 2020

bobcatfish commented Jan 21, 2020

bobcatfish commented Jan 18, 2020 •

edited

Loading