Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more sustainable approach to owning and maintaining test/release infrastructure #737

Closed
jlewi opened this issue Jul 17, 2020 · 17 comments
Closed

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 17, 2020

We need a more sustainable and scalable approach to owning and maintaining engprod(test & release) infrastructure for kubeflow.

We currently lack a sufficient number of individuals (3-5?) with the willingness and ability (e.g. time) to meet the increasing demands of Kubeflow as it grows.

I think there are a couple of problems we need to address

  • How do we drive larger engprod projects that would make our infrastructure more scalable?
    • e.g. Using GitOps to make it easier for anyone to submit changes to the infra
    • Automation of publishing release artifacts
  • How do we staff a build cop rotation so that developers aren't blocked for long periods when there are operational problems?

I don't think we can address this by. simply increasing the pool of "20%" contributors to engprod.

  • Its hard to be a good leader in your 20% time

    • We need folks who can identify problems, propose solutions, and drive those solutions to completion
    • This is tough to do if your just working on a project 20% of the time.
  • For security reasons, we can't grant an increasing number of individuals the elevated permissions needed to maintain the test/release infrastructure

  • Engprod operational issues are often P0 because they block everyone from getting work done; we can't expect 20% contributors to drop everything in order to respond to P0s.

I can think of two approaches to this

  1. We try to staff a horizontal team supporting enprod across Kubeflow
  2. We can make each WG responsible for its own test/release infrastructure

@cliveseldon @yuzisun @ellistarn @neuromage @paveldournov @elikatsis @vpavlin @yanniszark @Jeffwan @krishnadurai @terrytangyuan @gaocegege @andreyvelich @johnugeorge @aronchick @StefanoFioravanzo @elviraux @kimwnasptd @krazyhaas @jinchihe @animeshsingh

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.72
area/engprod 0.76

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@8bitmp3
Copy link

8bitmp3 commented Aug 12, 2020

+1. Bumping this discussion @jlewi

@swiftdiaries
Copy link
Member

I have a suggestion on this. I think there's 3 levels of testing in Kubeflow.

  • Native tests (unit and integration tests that run natively on a machine)
  • Application tests (KFServing, Jupyter-Web-App etc...)
  • E2E tests (platform level tests)

The Native tests and Application tests probably need to be nimble and fast. I was thinking we could use GitHub Actions for this. This would kind of free-up testing resources and unblock application owners from delving deep into the Prow/k8s test infra space (because that is a lot of cognitive load).

The E2E tests can be owned by the platform owners with each of them bringing their own environments and staffing it themselves. So currently, we have:

  • GCP
  • IBM
  • AWS
  • vanilla k8s (or onpremise)

@Jeffwan probably has the right idea with #748. Istio has a similar setup with IBM having separate testing environment and so on... This might solve issues around permissions and also unblock application owners to think about testing their applications independently.

What do people think?

/cc @jlewi @animeshsingh @Jeffwan @8bitmp3

@jlewi
Copy link
Contributor Author

jlewi commented Aug 18, 2020

@swiftdiaries With GitHub actions how are they administered? Would individual projects be able to administer GitHub actions for themselves or would there be a maintenance cost for GitHub org admins?

@swiftdiaries
Copy link
Member

swiftdiaries commented Aug 18, 2020

They can manage it themselves at the repo level and an OWNERS file within the repo

@yanniszark
Copy link
Contributor

@swiftdiaries I agree that Github Actions are a great asset and much easier to figure out and get started than the current Prow setup. It would be a great option to enable application developers to quickly develop testing pipelines for their code.
Plus, developers can benefit from a multitude of community-developed actions.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 20, 2020

I am not sure if all applications owners like to setup testing pipelines by themselves. I assume some engineers them just want to write test cases and roll out to a cluster which can quickly pick tests up. We probably need to cover this cases as well.

@thesuperzapper
Copy link
Member

+1 on the GitHub actions for everything but the end-to-end testing/CICD.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 21, 2020

One downside of GItHub actions is they are pretty closely tied to GitHub. One of the motivations for using Tekton and Kubernetes native test infra was to make it easy for people to replicate and run the tests on their own infra.

That said I think the decision should be up to the individual WGs what they want to proscribe and maintain for their respective projects.

From that perspective the question should be is can the WG leads scalably administer GitHub actions on behalf of their projects? For example, can they onboard new projects/repos without creating toil for the Kubeflow GitHub org admins?

Another issue would be billing and quota. How do we ensure fair scheduling between WGs? If a WG needs to exceed the free tier are the WGs leads in a position to assume those costs?

@yanniszark
Copy link
Contributor

yanniszark commented Aug 21, 2020

One downside of GItHub actions is they are pretty closely tied to GitHub. One of the motivations for using Tekton and Kubernetes native test infra was to make it easy for people to replicate and run the tests on their own infra.

GitHub has released the runner code, so it should be possible for users to run these pipelines on their own.
Plus, there are open source projects that even let you run them locally:
https://github.com/actions/runner
https://github.com/nektos/act

Another issue would be billing and quota. How do we ensure fair scheduling between WGs? If a WG needs to exceed the free tier are the WGs leads in a position to assume those costs?

The free tier is unlimited for public repos so I don't think that's an issue.
The only real restriction I see is the max 20 concurrent jobs (which can be circumvented with hosted runners).
Plus, if the tests are lightweight (max 10min) maybe that won't be an issue for some time. It would help if we could have data for the rate of PR tests triggering.

@jlewi I think the main problem of using our own tekton+kubernetes infra is this:

We currently lack a sufficient number of individuals (3-5?) with the willingness and ability (e.g. time) to meet the increasing demands of Kubeflow as it grows.

Instead of having to find these individuals, train them in Kubeflow's complicated infra and make sure they are on-call for issues (e.g., quotas filling up), we can use Github Actions, which being a managed service, circumvents this restriction. Plus, we can always use self-hosted runners, which circumvent usage limits, if we bump into scaling issues.

That said I think the decision should be up to the individual WGs what they want to proscribe and maintain for their respective projects.

Totally agree!

@jlewi
Copy link
Contributor Author

jlewi commented Aug 26, 2020

Related issue buildcop: #658

@jlewi
Copy link
Contributor Author

jlewi commented Aug 26, 2020

@kubeflow/automl-leads @kubeflow/kfserving-owners @kubeflow/training-leads thoughts?

@gaocegege
Copy link
Member

cc @kubeflow/wg-automl-leads @kubeflow/wg-training-leads

I am not sure if all applications owners like to setup testing pipelines by themselves. I assume some engineers them just want to write test cases and roll out to a cluster which can quickly pick tests up. We probably need to cover this cases as well.

Same idea with @Jeffwan

@jlewi
Copy link
Contributor Author

jlewi commented Sep 1, 2020

Related issue to move Kubeflow to the Google instance off prow and stop using Kubernetes.
kubernetes/test-infra#14343

@Jeffwan
Copy link
Member

Jeffwan commented Sep 22, 2020

I add a doc for community member to review. This provides an alternative option to run e2e tests on AWS

http://bit.ly/kubeflow-test-infra-aws

@stale
Copy link

stale bot commented Dec 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Dec 31, 2020

This issue has been closed due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants