A more sustainable approach to owning and maintaining test/release infrastructure #737

jlewi · 2020-07-17T17:01:17Z

We need a more sustainable and scalable approach to owning and maintaining engprod(test & release) infrastructure for kubeflow.

We currently lack a sufficient number of individuals (3-5?) with the willingness and ability (e.g. time) to meet the increasing demands of Kubeflow as it grows.

I think there are a couple of problems we need to address

How do we drive larger engprod projects that would make our infrastructure more scalable?
- e.g. Using GitOps to make it easier for anyone to submit changes to the infra
- Automation of publishing release artifacts
How do we staff a build cop rotation so that developers aren't blocked for long periods when there are operational problems?

I don't think we can address this by. simply increasing the pool of "20%" contributors to engprod.

Its hard to be a good leader in your 20% time
- We need folks who can identify problems, propose solutions, and drive those solutions to completion
- This is tough to do if your just working on a project 20% of the time.
For security reasons, we can't grant an increasing number of individuals the elevated permissions needed to maintain the test/release infrastructure
Engprod operational issues are often P0 because they block everyone from getting work done; we can't expect 20% contributors to drop everything in order to respond to P0s.

I can think of two approaches to this

We try to staff a horizontal team supporting enprod across Kubeflow
We can make each WG responsible for its own test/release infrastructure

@cliveseldon @yuzisun @ellistarn @neuromage @paveldournov @elikatsis @vpavlin @yanniszark @Jeffwan @krishnadurai @terrytangyuan @gaocegege @andreyvelich @johnugeorge @aronchick @StefanoFioravanzo @elviraux @kimwnasptd @krazyhaas @jinchihe @animeshsingh

issue-label-bot · 2020-07-17T17:01:25Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.72
area/engprod	0.76

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

8bitmp3 · 2020-08-12T13:06:40Z

+1. Bumping this discussion @jlewi

swiftdiaries · 2020-08-18T15:10:36Z

I have a suggestion on this. I think there's 3 levels of testing in Kubeflow.

Native tests (unit and integration tests that run natively on a machine)
Application tests (KFServing, Jupyter-Web-App etc...)
E2E tests (platform level tests)

The Native tests and Application tests probably need to be nimble and fast. I was thinking we could use GitHub Actions for this. This would kind of free-up testing resources and unblock application owners from delving deep into the Prow/k8s test infra space (because that is a lot of cognitive load).

The E2E tests can be owned by the platform owners with each of them bringing their own environments and staffing it themselves. So currently, we have:

GCP
IBM
AWS
vanilla k8s (or onpremise)

@Jeffwan probably has the right idea with #748. Istio has a similar setup with IBM having separate testing environment and so on... This might solve issues around permissions and also unblock application owners to think about testing their applications independently.

What do people think?

/cc @jlewi @animeshsingh @Jeffwan @8bitmp3

jlewi · 2020-08-18T15:53:12Z

@swiftdiaries With GitHub actions how are they administered? Would individual projects be able to administer GitHub actions for themselves or would there be a maintenance cost for GitHub org admins?

swiftdiaries · 2020-08-18T15:54:02Z

They can manage it themselves at the repo level and an OWNERS file within the repo

yanniszark · 2020-08-20T21:42:17Z

@swiftdiaries I agree that Github Actions are a great asset and much easier to figure out and get started than the current Prow setup. It would be a great option to enable application developers to quickly develop testing pipelines for their code.
Plus, developers can benefit from a multitude of community-developed actions.

Jeffwan · 2020-08-20T21:52:26Z

I am not sure if all applications owners like to setup testing pipelines by themselves. I assume some engineers them just want to write test cases and roll out to a cluster which can quickly pick tests up. We probably need to cover this cases as well.

thesuperzapper · 2020-08-21T02:22:34Z

+1 on the GitHub actions for everything but the end-to-end testing/CICD.

jlewi · 2020-08-21T14:25:33Z

One downside of GItHub actions is they are pretty closely tied to GitHub. One of the motivations for using Tekton and Kubernetes native test infra was to make it easy for people to replicate and run the tests on their own infra.

That said I think the decision should be up to the individual WGs what they want to proscribe and maintain for their respective projects.

From that perspective the question should be is can the WG leads scalably administer GitHub actions on behalf of their projects? For example, can they onboard new projects/repos without creating toil for the Kubeflow GitHub org admins?

Another issue would be billing and quota. How do we ensure fair scheduling between WGs? If a WG needs to exceed the free tier are the WGs leads in a position to assume those costs?

yanniszark · 2020-08-21T17:38:57Z

One downside of GItHub actions is they are pretty closely tied to GitHub. One of the motivations for using Tekton and Kubernetes native test infra was to make it easy for people to replicate and run the tests on their own infra.

GitHub has released the runner code, so it should be possible for users to run these pipelines on their own.
Plus, there are open source projects that even let you run them locally:
https://github.com/actions/runner
https://github.com/nektos/act

Another issue would be billing and quota. How do we ensure fair scheduling between WGs? If a WG needs to exceed the free tier are the WGs leads in a position to assume those costs?

The free tier is unlimited for public repos so I don't think that's an issue.
The only real restriction I see is the max 20 concurrent jobs (which can be circumvented with hosted runners).
Plus, if the tests are lightweight (max 10min) maybe that won't be an issue for some time. It would help if we could have data for the rate of PR tests triggering.

@jlewi I think the main problem of using our own tekton+kubernetes infra is this:

We currently lack a sufficient number of individuals (3-5?) with the willingness and ability (e.g. time) to meet the increasing demands of Kubeflow as it grows.

Instead of having to find these individuals, train them in Kubeflow's complicated infra and make sure they are on-call for issues (e.g., quotas filling up), we can use Github Actions, which being a managed service, circumvents this restriction. Plus, we can always use self-hosted runners, which circumvent usage limits, if we bump into scaling issues.

That said I think the decision should be up to the individual WGs what they want to proscribe and maintain for their respective projects.

Totally agree!

jlewi · 2020-08-26T03:19:45Z

Related issue buildcop: #658

jlewi · 2020-08-26T22:33:10Z

@kubeflow/automl-leads @kubeflow/kfserving-owners @kubeflow/training-leads thoughts?

gaocegege · 2020-08-27T02:18:55Z

cc @kubeflow/wg-automl-leads @kubeflow/wg-training-leads

I am not sure if all applications owners like to setup testing pipelines by themselves. I assume some engineers them just want to write test cases and roll out to a cluster which can quickly pick tests up. We probably need to cover this cases as well.

Same idea with @Jeffwan

jlewi · 2020-09-01T15:27:23Z

Related issue to move Kubeflow to the Google instance off prow and stop using Kubernetes.
kubernetes/test-infra#14343

Jeffwan · 2020-09-22T05:18:16Z

I add a doc for community member to review. This provides an alternative option to run e2e tests on AWS

http://bit.ly/kubeflow-test-infra-aws

stale · 2020-12-24T11:08:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

stale · 2020-12-31T20:01:50Z

This issue has been closed due to inactivity.

issue-label-bot bot added area/engprod kind/feature labels Jul 17, 2020

jlewi mentioned this issue Jul 23, 2020

feat(wg): Add WG Training kubeflow/community#356

Merged

jlewi mentioned this issue Aug 5, 2020

Grant appropriate IAP permissions to access tekton test UI #736

Closed

jlewi mentioned this issue Aug 21, 2020

Insufficient regional quota to satisfy request and katib job is blocked. #749

Closed

andreyvelich mentioned this issue Aug 21, 2020

[Release 1.2] Feature Planning / Roadmap kubeflow/kubeflow#5224

Closed

jlewi mentioned this issue Aug 25, 2020

KFServing CI cluster creation permission issue kserve/kserve#1044

Closed

This was referenced Aug 26, 2020

Add dedicate GCP project for training testing infra kubeflow/community-infra#13

Closed

Add AutoML project kubeflow/community-infra#11

Merged

jlewi added priority/p0 area/community labels Aug 27, 2020

This was referenced Sep 1, 2020

Add WG Notebooks kubeflow/community#407

Merged

How do we surface sync errors kubeflow/community-infra#15

Open

jlewi mentioned this issue Sep 8, 2020

create wg-deployment kubeflow/community#402

Merged

jlewi mentioned this issue Sep 8, 2020

Need additional kf-kcc admins kubeflow/community-infra#14

Open

andreyvelich mentioned this issue Sep 10, 2020

Long term solution for AutoML CI/CD test infrastructure kubeflow/katib#1332

Closed

This was referenced Sep 15, 2020

Add a Feature Store SIG kubeflow/community#408

Merged

GitOps for kubeflow-releasing infrastructure #693

Closed

jlewi mentioned this issue Sep 28, 2020

AWS Init Support for Kuberflow/Testing #755

Merged

Jeffwan mentioned this issue Oct 18, 2020

Remove presubmit job for Kubeflow pytorch-operator kubernetes/test-infra#19617

Merged

This was referenced Oct 26, 2020

AutoML WG needs to take ownership of its release infrastructure, process, and artifacts kubeflow/katib#1367

Closed

Add support for homebrew kubeflow/kfctl#420

Open

jlewi mentioned this issue Nov 4, 2020

Convert Python unittest Argo Workflow to Tekton #769

Closed

jlewi mentioned this issue Nov 20, 2020

Where should notebooks code live kubeflow/kubeflow#5418

Closed

stale bot added the lifecycle/stale label Dec 24, 2020

stale bot closed this as completed Dec 31, 2020

Bobgy mentioned this issue Mar 13, 2021

Add Kubeflow versioning proposal kubeflow/community#498

Closed

jlewi mentioned this issue May 31, 2022

Alternative solution to removal of test on optional-test-infra #1006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A more sustainable approach to owning and maintaining test/release infrastructure #737

A more sustainable approach to owning and maintaining test/release infrastructure #737

jlewi commented Jul 17, 2020

issue-label-bot bot commented Jul 17, 2020

8bitmp3 commented Aug 12, 2020

swiftdiaries commented Aug 18, 2020

jlewi commented Aug 18, 2020

swiftdiaries commented Aug 18, 2020 •

edited

Loading

yanniszark commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

thesuperzapper commented Aug 21, 2020

jlewi commented Aug 21, 2020

yanniszark commented Aug 21, 2020 •

edited

Loading

jlewi commented Aug 26, 2020

jlewi commented Aug 26, 2020

gaocegege commented Aug 27, 2020

jlewi commented Sep 1, 2020

Jeffwan commented Sep 22, 2020

stale bot commented Dec 24, 2020

stale bot commented Dec 31, 2020

A more sustainable approach to owning and maintaining test/release infrastructure #737

A more sustainable approach to owning and maintaining test/release infrastructure #737

Comments

jlewi commented Jul 17, 2020

issue-label-bot bot commented Jul 17, 2020

8bitmp3 commented Aug 12, 2020

swiftdiaries commented Aug 18, 2020

jlewi commented Aug 18, 2020

swiftdiaries commented Aug 18, 2020 • edited Loading

yanniszark commented Aug 20, 2020

Jeffwan commented Aug 20, 2020

thesuperzapper commented Aug 21, 2020

jlewi commented Aug 21, 2020

yanniszark commented Aug 21, 2020 • edited Loading

jlewi commented Aug 26, 2020

jlewi commented Aug 26, 2020

gaocegege commented Aug 27, 2020

jlewi commented Sep 1, 2020

Jeffwan commented Sep 22, 2020

stale bot commented Dec 24, 2020

stale bot commented Dec 31, 2020

swiftdiaries commented Aug 18, 2020 •

edited

Loading

yanniszark commented Aug 21, 2020 •

edited

Loading