-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Move boskos testing projects pool to kubernetes.io #390
Comments
/retitle RFC: Move boskos testing projects pool to kubernetes.io |
roughly it's just projects with the CI service account and some admins having access, and quota depending on which pool they are going into
Literally bare. No resources. Just a namespace w/ quota
Some humans should have backup access, but primarily the CI service account needs access. That would be In the future this should be some service account from a publicly owned prow.
Each boskos pool is defined by the kind of quota present. I don't think the GCP non-gke pool is particularly special (and the GKE pool should be managed by GKE...) There are also pools for EG GPU testing, which need quota for that, and I think for scale testing (which of course need more of basically all resources)
We should consider the fact that the state of "is this project available" is in CRDs in the build cluster. |
... continuing (accidentally hit enter) As long as the state is in the build cluster, that means to switch prow over we'll either have serious disruption (need to spin down the pool) or need a whole new pool. Humans generally have no need to access these projects, so in terms of getting the community access to the infra, the boskos projects are uninteresting, they're ~100% controlled by automation via public config already. In terms of spending CNCF GCP credits, they're somewhat more interesting I suppose, if that's what we're going for. If we're interested in just getting things migrated because we should migrate things, it will be much more useful to migrate boskos along with the management and state and generally replace the legacy prow.k8s.io service accounts etc. (can you tell by the jenkins in the name?) ... |
/assign @thockin |
This seems like something we can and should enable ASAP. Christoph started with great questions. I'd like to add a couple: How can we break down the billing or attribution for this? With a single big pool and a single CI service account, I have no idea who spent what money on what things. I think we need to do better than this.
Should we be EOL'ing projects after some number of uses (one? ten?) just for sanity? Can quota requests be automated? Who owns this, that we can have this conversation? The net result of this is probably a script which ensures the requisite projects exist and have the correct IAM for the appropriate CI SA, plus a link to docs explaining what they are for. That alone seems straight-forward, but without an owner to drive it, I don't think we can reasonably do much. |
We can but the CI users will need to correctly activate their unique service account. These service accounts need to make their way into Prow, and not much prevents someone from using the wrong SA (the Prow cluster is so old in-place upgraded that it doesn't have RBAC...) Older style "bootstrap.py" prowjobs (don't worry about the details, most of our CI jobs are these though...) automagically activate a default service account before we get to testing.
Yes, we have a few pools today. The GCP projects are monitored here showing a few types (EG GPU): The full set of resources (including AWS) is here: https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml The biggest trick is just figuring out what a distinct use is and carving these up... Most tests use the generic GCE pool but they don't have to.
Prow runs O(12000) build/tests a day, if only 25% of those are GCP e2e we'd churn through ~300 projects a day at 10 uses before retiring. I think this probably wouldn't scale.
I took a quick look now and didn't see an API, but I'm not sure.
• boskos the tool? => I might have an answer, but waiting for confirmation |
On Wed, Oct 16, 2019 at 7:54 PM Benjamin Elder ***@***.***> wrote:
Can we use a service account for each coarse "purpose"?
We can but the CI users will need to correctly activate their unique
service account.
These service accounts need to make their way into Prow, and not much
prevents someone from using the wrong SA (the Prow cluster is so old
in-place upgraded that it doesn't have RBAC...)
As we consider moving prow into community space, we will HAVE to get a
better story around this.
Older style "bootstrap.py" prowjobs (don't worry about the details, most of
our CI jobs are these though...) automagically activate a default service
account before we get to testing.
Can we use a distinct pool of projects for each coarse purpose?
Yes, we have a few pools today. The GCP projects are monitored here
showing a few types (EG GPU):
http://velodrome.k8s.io/dashboard/db/boskos-dashboard?orgId=1
The full set of resources (including AWS) is here:
https://github.com/kubernetes/test-infra/blob/d8449cb095fb6dc791958bbaf8940c7c1007410c/prow/cluster/boskos-resources.yaml
The biggest trick is just figuring out what a distinct use is and carving
these up...
Unfortunately a ton of our CI tests are relatively ownerless so this may
be tricky.
Cull the herd?
Most tests use the generic GCE pool but they don't *have* to.
I'd like to set the objective at EVERY test identifies which pool it
belongs to and then as needed we can split those pools to better indicate
the prime spenders.
Should we be EOL'ing projects after some number of uses (one? ten?) just
for sanity?
Prow runs O(1200) build/tests a day, if only 25% of those are GCP e2e we'd
churn through ~300 projects a day at 10 uses before retiring. I think this
probably wouldn't scale.
A project takes 30-45 seconds to create. Not sure how quota would affect
this but I don't see a scale problem (or at least, it doesn't seem WILDLY
insane - we could try it :)
Can quota requests be automated?
I took a quick look now don't see an API, but I'm not sure.
Who owns this, that we can have this conversation?
• boskos the tool? => I might have an answer, but waiting for confirmation
Yes
• or this migration? ... unsure
Yes
• the prow.k8s.io deployment? => the test-infra maintainers / google
engprod team nominally at the moment, the infra runs in the build / test
workload cluster.
Less interesting for this thread :)
|
Agreed. I'm certainly not thrilled about the current state... That said, I generally don't think we can consider the presubmit testing to be trustworthy, and scheduling with boskos is cooperative. Changing that would be a bit involved.
Yes and no. A lot of valuable signal shouldn't be culled imo, but still doesn't have a clear owner 😞 (EG who owns the periodic integration and unit testing ...?) We probably need to enforce ownership better somehow. I'm not sure how.
We can do that incrementally with the new community owned pools we set up, I have no idea what the right granularity would be though.
... that is a lot faster than I thought. If we can get this to work, that would be a neat trick! 🙃
ACK, I'm hoping for an official "stepping up to the plate" in the next couple of days ... will circle back. @sebastienvas may serve as an transitionary owner (previously worked on this).
ACK ... I can certainly help, I'm also hoping for more help though, perhaps @dims who raised this :-) |
/assign |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
I have done some of this under #752
Permissions for the projects to facilitate community member troubleshooting TBD under #844
I have thus far only worked out what is needed to match the 'scalability' pool (#851), there are others for ingress and gpu TBD
How do we do billing per-job or per-sig? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
I think this is progressing somewhat? |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten Yes, this is progressing: Revisiting the description
Done. Unfortunately it's not possible to adjust project quota via API. Everything else is scripted today via https://github.com/kubernetes/k8s.io/blob/master/infra/gcp/ensure-e2e-projects.sh
Done. This was done during development of ensure-e2e-projects.sh. "Bare" as in no-quota-adjustments projects are in the gce-project pool.
Done for the scope of this issue.
Done for all pools except ingress projects and aws accounts (which I'm excluding from this issue since they're not gcp projects)
There is the question of billing per-job or billing per-sig, in a way that accounts for both cluster-usage and project-usage. I think we should call that out of scope for this issue. I'm personally ready to close this out. What follow up work do folks think we should have tracking issues for? |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'd like to start looking at moving the boskos pool over to public-owned projects.
Things I see on the surface we'd need to do:
cc: @kubernetes/test-infra-admins
The text was updated successfully, but these errors were encountered: