Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] documentation for production grade deployment of kubeflow pipelines #6204

Closed
darthsuogles opened this issue Aug 1, 2021 · 17 comments
Assignees
Labels
kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@darthsuogles
Copy link

darthsuogles commented Aug 1, 2021

Feature Area

/area documentation
/area samples
/area deployment

What feature would you like to see?

Documentation for production-grade deployment of kubeflow pipelines.

What is the use case or pain point?

Is there a workaround currently?

Unaware


Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

@Bobgy Bobgy self-assigned this Aug 6, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Aug 6, 2021

I have some personal notes on the topic, will try to document them.

@darthsuogles
Copy link
Author

Thank you!
Any chance you had time working on this in the past couple of weeks?

@vinayan3
Copy link

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

  • ml-pipeline
  • metadata-grpc-service
  • ml-pipeline-visualizationserver

Things I'm not sure about are:

  • controller-manager-service

@Bobgy
Copy link
Contributor

Bobgy commented Aug 20, 2021

Posting my unedited notes first, will try to revisit. Looking forward to any feedback.

Some of these tips are Google Cloud specific, but most of them are general advice.

  • Deploy in a regional cluster, even if your workload runs on zonal nodepools. Regional clusters have multiple instances of K8s api server, so K8s api is highly available. During scaling, upgrade or many maintenance operations, zonal cluster k8s api servers are not responsive.

  • For KFP on GCP configure a nodepool default Google Service Account (GSA) with minimal permissions. You can grant serviceAccountUser permission to users/GSAs on this GSA to allow access to the proxy.

  • Recommend enabling nodepool autoscaling when there are too many workloads.

  • Set memory/CPU requests/limit on pipeline steps to guarantee they are not evicted when the cluster is under resource constraints. Also, Kubernetes use resource requests as the signal for node pool scaling, so when you enable auto-scaling, you should always set resource requests, so that Kubernetes can properly identify when you want to scale up/down.
    Resource requests/limits can be set using KFP DSL, example pipeline. Reference: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html#kfp.dsl.Sidecar.set_memory_limit.

  • Set memory/CPU requests/limit on system services, latest KFP release already have sane default values. However, KFP API server (ml-pipeline deployment), KFP persistence agent (ml-pipeline-persistence-agent deployment) and argo workflow controller (workflow-controller deployment) memory/CPU needs are roughly linear to the number of concurrent workflows (even if they are completed), Therefore do:

  • Reduce workflow TTL of completed workflows to match your use-case. Default is 1 day.

  • Monitor these deployments and set requests/limits based on real usage + some buffer.

  • Set up retry strategies for steps in error state. There are two types of failures, error and failure. Error refers to orchestration system problems. While failure refers to user container failures. So it’s recommended to specify retryStrategy at least for errors, and depending on use-case also for failures.
    Example: you can set set_retry(policy="Always"). # or “OnError”

  • If you need to customize the deployment, pull KFP manifests as an upstream and follow the off the shelf application workflow of kustomize. This allows infrastructure as code and easy upgrades.

  • A bonus point is to use gitops (there are many tools for similar purposes), put your infrastructure as code in a repo and use a gitops tool to sync it to production. In this way, you can version control, roll back, etc.

  • Use managed storage (Cloud SQL & Cloud Storage) to simplify lifecycle management: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample .

  • Configure a lifecycle policy (e.g. clean up intermediate artifacts after 7 days) for the object store you are using, e.g. for minio and for gcs. Note, on the default minio bucket, intermediate artifacts are stored in minio://mlpipeline/artifacts, pipeline templates are stored in minio://mlpipeline/pipelines, so do not set a lifecycle for pipeline templates, they should be kept.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 21, 2021

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

  • ml-pipeline
  • metadata-grpc-service
  • ml-pipeline-visualizationserver

Things I'm not sure about are:

  • controller-manager-service

This is something I haven't experimented much, from my understanding:

  • ml-pipeline-ui
  • ml-pipeline*
  • metadata-grpc-service*
  • ml-pipeline-visualizationserver

can be made multi replica right now.

There is a caveat that ml-pipeline and metadata-grpc-service upgrade DB schema on start up, so if you are doing an upgrade, recommend changing replica to 1 first.

The controllers should be able to run in leader election mode: one instance is leader, one instance is standby, whenever the leader dies, the standby instance takes over. However, I believe for KFP controllers some dependency upgrade might be necessary and we need to expose flags.
Argo workflow controller can be set up this way now. https://argoproj.github.io/argo-workflows/high-availability/

@vinayan3
Copy link

@Bobgy I've taken the suggestions above for the things that can have more than replica count one. I've also added in PodDisruptionBudgets and put Pod Topology Spread Constraints to avoid all the replicas going onto a single node.

I'll have to look into getting the argo workflow controller to have an active / passive mode.

Thanks for suggestions and advice. It's really appreciated.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 22, 2021

Cool, interested to see how that plays out.

@rubenaranamorera
Copy link

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

@Bobgy
Copy link
Contributor

Bobgy commented Aug 25, 2021

@rubenaranamorera There's a feature request in #6001.

@Bobgy
Copy link
Contributor

Bobgy commented Aug 25, 2021

minor update, I added a last point in my comment above about configuring a lifecycle policy for the object store.

@NikeNano
Copy link
Member

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

You(@rubenaranamorera ) can use the SDK if you like to, I did some stuff with this for github actions(it has not been update in quite some time so might need some love to work for you) https://github.com/NikeNano/kubeflow-github-action.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@zijianjoy zijianjoy removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022
@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2022

/lifecycle freeze

@vinayan3
Copy link

vinayan3 commented Mar 3, 2022

So after more than 6 months of running the configuration with replica > 1 there hasn't been any issues.

Also, for argocd workflows the controller may not need to be run with more than replica / sharded unless there is huge number of workflows. The pod gracefully restarts on other nodes and is able to pick up work where it left it off.

Would there be interest in creating an overlay for HA?

@daro1337
Copy link

daro1337 commented Mar 1, 2024

@vinayan3 could you please sum up which components can be easily scaled and did not bring any malfunction for your deployment? Thanks in advance

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 18, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

7 participants