Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Test Infra Deprecation Notice #993

Closed
PatrickXYS opened this issue Mar 5, 2022 · 11 comments · Fixed by #1003
Closed

Optional Test Infra Deprecation Notice #993

PatrickXYS opened this issue Mar 5, 2022 · 11 comments · Fixed by #1003

Comments

@PatrickXYS
Copy link
Member

PatrickXYS commented Mar 5, 2022

Hi Kubeflow Community,

I've been continuously working on and responsible for providing optional-test-infra for kubeflow community presubmit testing usage.

Due to my recent career change and there are no sufficient resources in making optional-test-infra continue to work, optional-test-infra may stop working around June 2022. That's said, kubeflow community WGs have a 2-months buffer time to find alternatives for replacing optional-test-infra as presubmit testing solution.

Through the journey, I appreciate all the efforts the community has been putting into optional-test-infra, especially @Jeffwan, @andreyvelich, @Bobgy, @theofpa, and anyone who contributed to it!

@kubeflow/wg-automl-leads
@kubeflow/wg-manifests-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-training-leads
@yuzisun

@terrytangyuan
Copy link
Member

That’s unfortunate to hear but best of luck on your new journey.

@kubeflow/wg-training-leads @kubeflow/wg-automl-leads Should we consider switching to GitHub Actions with K3d or minikube since our tests do not require special hardware?

@kubeflow/project-steering-group Any plans of additional resources from Google on this?

@surajkota
Copy link

surajkota commented Mar 17, 2022

Hi all,

Instead of depreciating the infrastructure, can we decouple the funding of account from design and implementation of the testing infrastructure running in this account? Creating a new infra might be a big effort

Funding of account:
I am from AWS and want to clarify that the AWS program for funding the account has not stopped. We did look at the account earlier this year and there were enough credits at that time for the CI to run. Please refer to this comment for the process of renewing the credits: kubeflow/manifests#2099 (comment).

Design and maintenance of testing infra
Thank you @PatrickXYS for owning the initial infrastructure design and implementation. In the long term, we need more than one person to maintain it. IMO the questions to ask are - are there folks interested in owning and driving the efforts required to maintaining this infrastructure? What is the effort required to maintain this? What docs/tutorials do we need for new contributors?

@thesuperzapper
Copy link
Member

@surajkota @terrytangyuan @PatrickXYS I really like the idea of using GitHub actions for most parts of Kubeflow!

This will make our tests portable (not tightly integrated with a specific sponsor's infrastructure).

Considerations:

  • github-hosted runners:
    • have no time-limits for open-source projects (the 2000 minute limit is only for private repos)
    • only have 2-CPU, 7GB-RAM, 14GB-SSD
  • self-hosted runners:
    • can be provided by a sponsor like AWS, with access to larger resources, and GPUs
  • action runners are quite different from the current optional-testing-infra approach:
    • the workflows run on isolated VMs (not Kubernetes clusters), however, we can dynamically spawn kind / k3d clusters in those VMs (if the test requires it)

The next question is how can an "infra sponsor" (like AWS) provide the self-hosted runners to the Kubeflow org:

  • One possible option is using something like philips-labs/terraform-aws-github-runner to dynamically scale AWS Spot instances for the runners
  • Alternatively, there are many other projects which help people manage their runners, see awesome-runners
  • Finally, if necessary to meet our security requirements, we could create a new project that focuses on securely managing self-hosted runners for "public" projects (we should probably collaborate with GitHub directly on this, as a good solution is needed for other open-source projects)

It's important to highlight that self-hosted runners have security considerations for the infra sponsor, but these can be mitigated if the self-hosted runers are set up in an isolated/ephemeral way (and if workflows require "approval" to run from untrusted people).

@surajkota
Copy link

surajkota commented Mar 19, 2022

Are you suggesting each working group own their own infra and setup?

If not, we are again missing the first step, i.e. finding new maintainers for testing infra.

On side note, I see no major issues on this repository which seem to suggest it is working pretty well. So before we jump to redesign, let's find out if we can have new maintainers, probably a wg-testing?

Regarding portability, isn't the current infra using prow? If yes, I think it is very much portable as well.

@thesuperzapper
Copy link
Member

Are you suggesting each working group own their own infra and setup?

If not, we are again missing the first step, i.e. finding new maintainers for testing infra.

@surajkota You are correct I was still suggesting that someone (like AWS) provides the self-hosted GitHub actions runners, which the rest of the WGs can then use on their kubeflow-org GitHub repos.

On side note, I see no major issues on this repository which seem to suggest it is working pretty well. So before we jump to redesign, let's find out if we can have new maintainers, probably a wg-testing?

I agree that it's not "broken", but GitHub actions is a very nice developer experience, so I think it's worth a look.
But I agree, if we can find new maintainers for the shared infrastructure, this will give us more time to consider our options. I assume at least a few of these new maintainers will have to work at AWS (because AWS is currently providing the physical infrastructure).

To ensure we don't end up with a single person risk again, we could form a wg-testing, with a similar mandate the Kubernetes sig-testing.

Regarding portability, isn't the current infra using prow? If yes, I think it is very much portable as well.

Yes, we are using prow in most repos (but some WGs are already migrating some things to GitHub actions).

@surajkota
Copy link

@PatrickXYS can you clarify what you mean by optional-test-infra may stop working around June 2022? Specifically what will stop working and how can this be stopped?

@PatrickXYS
Copy link
Member Author

As I said in the issue description, there are two aspects of NOT-WORKING:

  1. The account doesn't have enough resources/credits to continue working
  2. I don't have enough bandwidth in maintaining the infra

Kubeflow community / AWS could invest more credits to existing test-infra, but I may not be able to continue maintaining it, so I'd prefer not to go with this option.

The option that I prefer is the community should avoid establishing a horizontal team to maintain a centralized test-infra. Instead, allowing WGs to choose their own solutions should be more scalable and maintainable.

Created sub-issues in all repos which consume optional-test-infra as of now.

@kimwnasptd
Copy link
Member

@PatrickXYS thank you very much for all your efforts on this infra, it has really served us great throughout these years! This is also evident from the number of opened issues despite the fact that is being heavily used by at least kubeflow/{kubeflow,katib,training-operator}.

I really agree with @surajkota's approach on this situation. Let's try to understand first what are the commitment requirements for maintaining this infra, as well as which parts of the infra will need periodic care/maintenance. This way we can all better evaluate if it makes sense for us as a community to stick with this infra or start investing time in other solutions.

@PatrickXYS it's completely understandable that you don't have cycles on this anymore, and if there's no commitment from the rest of the community on helping maintain it then indeed let's deprecate it. But, again, please help us understand the maintenance burden first.

More specifically these are the first questions that come to mind:

  1. How can someone new get access to the AWS infra, for example see things via the AWS console?
  2. Do we have seme relevant documentation on the moving parts of this infra? For example:
    1. What is the webhook flow, triggered by GitHub?
    2. What's the entrypoint code that is run for each Prow Job, which are triggered by PRs?
  3. Which of these moving parts need periodic care? For example:
    1. What's the effort for maintaining the Prow cluster?
    2. How can we update the entrypoint code/image that Prow runs for every PR?

These are just some initial questions that come to mind, but I think can get us far enough for now.

@PatrickXYS
Copy link
Member Author

Sorry for the late response given the limited bandwidth on my side.

A few things to bring up here:

The timeline of deprecating optional-test-infra:

  1. By May 23rd, I'll start to file PR to remove presubmit request to optional-test-infra for existing kubeflow repos.
  2. By June 6th, I'll remove resources from AWS account and send a deprecation report to the community.

https://github.com/orgs/kubeflow/teams/wg-automl-leads
https://github.com/orgs/kubeflow/teams/wg-manifests-leads
https://github.com/orgs/kubeflow/teams/wg-notebooks-leads
https://github.com/orgs/kubeflow/teams/wg-training-leads
@yuzisun

Kubeflow WG folks, let's start off finding proper alternatives and migrate to those solutions that comply with the timeline.


To answer the question from @kimwnasptd :

  1. How can someone new get access to the AWS infra, for example, see things via the AWS console?
    A: This is the hardest problem to solve here given current AWS account is a personal account unless people can obtain the trust of the account owner, AWS, and Kubeflow community, otherwise I don't think it's possible.

  2. Do we have some relevant documentation on the moving parts of this infra? For example:

  • What is the webhook flow, triggered by GitHub?
    A: Yes there are some webhooks configured in some kubeflow repos, this is set up by previous Google folks. There's no public documentation given there's no well-defined privacy rule set up.
    What's the entrypoint code that is run for each Prow Job, which are triggered by PRs?
    A: Those technical questions could vary person by person, I'd recommend reading https://github.com/kubeflow/testing/tree/master/aws.
  1. Which of these moving parts need periodic care? For example:
  • What's the effort for maintaining the Prow cluster?
    A: Resources, time-to-time failure check, and monitoring.
    How can we update the entrypoint code/image that Prow runs for every PR?
    A: Those technical questions could vary person by person, I'd recommend reading https://github.com/kubeflow/testing/tree/master/aws.

I think the main thing here is: that the account is a personal account, we don't have well-defined privacy rules set up, and it's difficult to transfer ownership to other community folks. Thinking about finding alternatives might be a way easier thing to do.

@yuzisun
Copy link
Member

yuzisun commented May 1, 2022

@PatrickXYS I request holding off the deprecation as we have not reached a decision for the migration plan yet. Also as previously discussed with @surajkota aws is willing to keep sponsoring the account, I am not sure if there is anything changed.

@PatrickXYS
Copy link
Member Author

@yuzisun and other Kubeflow WG folks, I posted the deprecation notice on March 4th, trying to provide as much buffer time as possible to the community. Such that Kubeflow WGs can find their preferred alternatives for presubmit E2E testing. Also, I tagged all the WG and created sub-issues in corresponding repositories.

I'm not sure what's the main reason holding the community not finding other options for two months, and what's the current progress.

The AWS account is running out of credits, if we don't deprecate by the end of this month, it will charge my personal banking account (set up as backup) for thousands of dollars per month.

Please take any action to migrate to preferred alternatives ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants