-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boskos seems to be wedged #186
Comments
Can see these failures happening consistently across PRs:
|
Trying to address tektoncd#186 but it does nothing
I tried some things but nothing has worked:
|
I think the next thing to try is to update the boskos images to something newer: maybe the reason that outage + this started at the same time was that there was a rollout of something that is no longer compatible with our ancient boskos images (and their gcloud install?) |
Thank you for the detailed analysis! |
The issue seems to be still there, since I can see still a lot of failures in https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-pipeline-integration-tests.
The busy project is the one used for my retest, but there are three projects that are dirty with no owner, that boskos keeps trying to reset, but with no luck. For project14 specifically, something seems to be wrong with the setup:
|
/area boskos |
I tried updating boskos to the latest image available v20190621-ff01381, but the error in the janitor logs persists:
|
Update the container image used by Boskos components in an attempt to solve tektoncd#186.
Update the container image used by Boskos components in an attempt to solve #186.
I've deleted the disks from projects 10 and 0! |
Looks like lots of folks using boskos ran into problems on Friday: kubernetes/test-infra#15951 |
Okay so I looked into it a bit more. When I looked today after deleting the disks that @afrittoli noticed were not deleted (TODO for myself: give everyone more permissions!! or we can move to a model where maybe the CDF owns these clusters and everyone can be a full admin!!! ANYWAY) and all of the clusters were "free". I (finally) noticed that the Prow folks ran into these same errors on Friday resulting in kubernetes/test-infra#15951. It looks like the consensus is that gcloud itself had an outage (i.e. the API it communicates with). I also updated to the latest boskos images that the Prow folks are using (and noticed #193) but we should be good to go now! |
Using the same boskos version the prow folks are using, e.g., https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65 which they bumped to in the context of dealing with the same issue we ran into on Friday (tektoncd#186)
Using the same boskos version the prow folks are using, e.g., https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65 which they bumped to in the context of dealing with the same issue we ran into on Friday (#186)
Expected Behavior
Boskos should clean up projects once they are done being use and make them available for future use.
Actual Behavior
tektoncd/pipeline#1541 and tektoncd/pipeline#1888 both have consistently failing integration tests with an error like:
In #29 and other times in the past we have responded to this error by provisioning more projects for boskos.
This time though it's definitely not the case that all the projects are in use:
When I look at the logs from the boskos Janitor I see this kind of error:
I think the gcloud error might be a red herring, maybe a state that boskos gets into after some other kind of error first.
CPU and memory usage for both boskos + the boskos janitor started going up a few hours ago but its hard to say if that is causing the problem or if the problem is causing it:
Also this particular janitor pod has been steadily using more and more memory (interestingly this one was started on Jan 6 but the other two janitor pods had been around since like may)
The other 2 janitor pods look like:
Additional Info
I couldn't find any other quotas that seemed like they needed increasing. I think there's a good job that boskos got into a bad state and just restarting everything will fix it.
coincidentally there was a (seemingly unrelated?) GCP outage at the time when these errors started: https://status.cloud.google.com/incident/zall/20001 So maybe that put things into a bad state
it's also possible that this is because we're using such an old version of boskos and it might need an update - i think there's a good chance that updating boskos will solve the whole thing but I didn't want to rush to do that since we might run into other problems.
The text was updated successfully, but these errors were encountered: