-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] documentation for production grade deployment of kubeflow pipelines #6204
Comments
I have some personal notes on the topic, will try to document them. |
Thank you! |
@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for The other things that I think could have replica count >1 is:
Things I'm not sure about are:
|
Posting my unedited notes first, will try to revisit. Looking forward to any feedback. Some of these tips are Google Cloud specific, but most of them are general advice.
|
This is something I haven't experimented much, from my understanding:
can be made multi replica right now. There is a caveat that ml-pipeline and metadata-grpc-service upgrade DB schema on start up, so if you are doing an upgrade, recommend changing replica to 1 first. The controllers should be able to run in leader election mode: one instance is leader, one instance is standby, whenever the leader dies, the standby instance takes over. However, I believe for KFP controllers some dependency upgrade might be necessary and we need to expose flags. |
@Bobgy I've taken the suggestions above for the things that can have more than replica count one. I've also added in PodDisruptionBudgets and put Pod Topology Spread Constraints to avoid all the replicas going onto a single node. I'll have to look into getting the argo workflow controller to have an active / passive mode. Thanks for suggestions and advice. It's really appreciated. |
Cool, interested to see how that plays out. |
@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this? |
@rubenaranamorera There's a feature request in #6001. |
minor update, I added a last point in my comment above about configuring a lifecycle policy for the object store. |
You(@rubenaranamorera ) can use the SDK if you like to, I did some stuff with this for github actions(it has not been update in quite some time so might need some love to work for you) https://github.com/NikeNano/kubeflow-github-action. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/lifecycle freeze |
So after more than 6 months of running the configuration with replica > 1 there hasn't been any issues. Also, for argocd workflows the controller may not need to be run with more than replica / sharded unless there is huge number of workflows. The pod gracefully restarts on other nodes and is able to pick up work where it left it off. Would there be interest in creating an overlay for HA? |
@vinayan3 could you please sum up which components can be easily scaled and did not bring any malfunction for your deployment? Thanks in advance |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Feature Area
/area documentation
/area samples
/area deployment
What feature would you like to see?
Documentation for production-grade deployment of kubeflow pipelines.
What is the use case or pain point?
Is there a workaround currently?
Unaware
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.
The text was updated successfully, but these errors were encountered: