Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow on GCP - Support managed storage #4356

Closed
Bobgy opened this issue Aug 12, 2020 · 20 comments · Fixed by GoogleCloudPlatform/kubeflow-distribution#259
Closed

Kubeflow on GCP - Support managed storage #4356

Bobgy opened this issue Aug 12, 2020 · 20 comments · Fixed by GoogleCloudPlatform/kubeflow-distribution#259
Assignees
Labels

Comments

@Bobgy
Copy link
Contributor

Bobgy commented Aug 12, 2020

Google Cloud managed storages (GCS and CLoud SQL) make it easier for users to manage, backup and restore KFP data.
They are not currently supported for Kubeflow on GCP, we'd need some user feedback first before supporting them.

Please provide your feedback if this is important to you.

@Bobgy Bobgy self-assigned this Aug 12, 2020
@Bobgy Bobgy added status/triaged Whether the issue has been explicitly triaged area/deployment/kubeflow platform/gcp labels Aug 12, 2020
@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 12, 2020

Having data in storage outside of the cluster makes it more accessible. We've seen several users having problems trying to get access to this data. Latest one is #4327

When the data is inside the cluster, the MLMD is less useful. It stores the URIs, but they cannot be accessed directly.

@parthmishra
Copy link
Contributor

I thought Kubeflow on GCP has an option to enable Cloud SQL? What makes GCS/Cloud SQL incompatible with the full Kubeflow 1.0.2 installation? I thought it would be as simple as applying the cloudsqlproxy and minio patches, but I haven't tried.

Regarding importance, I also think managed storage via Cloud SQL and GCS is fairly important for us (and IMO, should probably just be the default on GCP). One thing that managed storage simplifies is lifecycle management, scaling, and high availability of cluster data.

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 12, 2020

I also think managed storage via Cloud SQL and GCS is fairly important for us

What is the relative importance of these features for you? How would you rank the following: "GCS without CLoudSQL", "CloudSQL without GCS", "GCS with CloudSQL".

@parthmishra
Copy link
Contributor

What is the relative importance of these features for you? How would you rank the following: "GCS without CLoudSQL", "CloudSQL without GCS", "GCS with CloudSQL".

I think I'd rank them:

  1. GCS with CloudSQL
  2. CloudSQL without GCS
  3. GCS without CloudSQL

I think this is obvious, having a stateless cluster reduces the operational burden and complexity significantly, as well as the reasons you guys brought up. For the second and third choices, I think it's tough but ultimately the operational burden to support an in-cluster hosted DB backed by a persistent disk outweighs the benefit of externalizing artifact storage on GCS. For the most part, we do that anyways but having it abstracted via minio is a nice-to-have.

@sm-hawkfish
Copy link

Another +1 on support for managed storage, which we are using via AI Platform Pipelines.

  • The CloudSQL database has opened up access to MLMD (and other metadata) to other applications (like an in-house model QC dashboard). So this support enables functionality and gives confidence from a business perspective that the audit trail of interactions with our system is safe.

  • Regarding GCS support, I would add to the above comments that it makes debugging easier, since users can fetch artifacts from GCS (e.g. a sample of training data) for local testing. It is also more of a defensive feature, since it makes it more likely that your users will use the system the way you intend. For example: KFP recommends that components read/write from local files and leave it up to the infrastructure to store them. If this backing store is a kubernetes volume, then I may not feel comfortable with this as the primary storage location of the artifacts. As a result, I would end up circumventing the KFP system in that I might manually read/write from GCS and pass only the URIs between components.

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 14, 2020

Thanks for the detailed feedbacks!

Forgot to ask one more question: what do you think making managed storage the default option on GCP?

@Ark-kun
Copy link
Contributor

Ark-kun commented Aug 15, 2020

  • For example: KFP recommends that components read/write from local files and leave it up to the infrastructure to store them. If this backing store is a kubernetes volume, then I may not feel comfortable with this as the primary storage location of the artifacts. As a result, I would end up circumventing the KFP system in that I might manually read/write from GCS and pass only the URIs between components.

Interesting. I would have expected the opposite: With artifacts being in the open GCS instead of in-cluster PD, the users are free to tinker with them, break the immutability guarantees and have untracked side-channel data access. While with more black-box store, they can only work with the intermediate data via the system channels.
It's an interesting data point.

@marrrcin
Copy link

I'm surprised to find this topic on github as I thought that newest version supports managed storage out-of-the-box, especially because there is a KFP link related to managed storage configuration:
https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/gcp

Right now we're facing the issues with migration to another Kubeflow cluster, we wanted to persist all pipeline runs (experiments) but it seems to be really laborious task with current setup (in-cluster PDs and MySQL).

I totally agree with @parthmishra - those are basically my feelings around this topic:

I thought Kubeflow on GCP has an option to enable Cloud SQL? What makes GCS/Cloud SQL incompatible with the full Kubeflow 1.0.2 installation? I thought it would be as simple as applying the cloudsqlproxy and minio patches, but I haven't tried.

Regarding importance, I also think managed storage via Cloud SQL and GCS is fairly important for us (and IMO, should probably just be the default on GCP). One thing that managed storage simplifies is lifecycle management, scaling, and high availability of cluster data.

@Bobgy
I agree that managed storage should be default on GCP, including CloudSQL + GCS. This way it will be possible to delete and restore the clusters at will, without the pain of loosing existing experiments and run's outputs. GCS storage can be configured with bucket only policy for a specific service account, which will forbid external/accidental tinkering with the data stored by Kubeflow.

@willypicard
Copy link

I completely agree with @marrrcin and @Bobgy: managed storage, enabled by default on GCP, would allow for easy cluster backups and cluster sizing. Definitively a highly important topic!

@wmikolajczyk
Copy link

wmikolajczyk commented Aug 18, 2020

I'm facing the same issues (as @marrrcin) with moving my experiments between Kubeflow clusters.
@Bobgy
I would love to have support for managed storage on GCP

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 18, 2020

To answer some of the questions:

Yes, KFP standalone https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/gcp supports managed storage on GCP.

For full Kubeflow, the 1.1 release timeline was fairly tight, I was only able to get KFP multi-user support there, but didn't manage to tackle managed storage (although it isn't a lot of work, I should be able to reuse the standalone configuration and then write some documentation). I created this issue to get some early feedback and considering the positive feedback, I'll see if I can prioritize making managed storage as an option in the Kubeflow 1.1.1.

Starting from next Kubeflow minor release will be a good timing we can consider changing it as the default.

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 18, 2020

FYI, I created another issue for supporting upgrading Kubeflow 1.0.2 to Kubeflow 1.1.0 while keeping data: #4346. It's agnostic to managed storage, because KFP switched from shared mode to multi-user mode in Kubeflow 1.1.0 and DB migration will be required to keep your data.

Looks like some of you in this thread will be interested in that topic too, so I am talking about it here.

@rmgogogo
Copy link
Contributor

Hi Yuan, assume your mean this ticket is for Kubeflow fullset. Not Kubeflow Pipeline (standalone and Hosted/CAIP Pipeline).

@Bobgy
Copy link
Contributor Author

Bobgy commented Aug 19, 2020

@rmgogogo Yes, this issue is for the full Kubeflow installation.

@RoyerRamirez
Copy link

Hi @jlewi, I joined the "Kubeflow Community Pipelines Meeting" today, and was hoping we could add CloudSQL and GCS support for the new kubeflow v1.1.* deployment. We are getting to ready to deploy in the next few weeks, and it would be awesome if we could get this feature out. We think it would make Kubeflow more stable and easier to maintain in production. Otherwise, we'll go ahead and deploy the standard version.

~ Royer

@jlewi
Copy link
Contributor

jlewi commented Sep 14, 2020

@RoyerRamirez I'm not working on this issue. I would suggest working with the KFP folks and the folks mentioned in this issue to see about prioritizing it if you think its important.

@Bobgy
Copy link
Contributor Author

Bobgy commented Sep 16, 2020

I've got enough feedback and will work on this next week month, but there are some uncertainties how we can release this (as an option or as default). So please do not wait for this if you are onto something urgent.

@connorlwilkes
Copy link

connorlwilkes commented Sep 29, 2020

Hi, just a few findings from my end when deploying with cloudsql and GCS. Firstly context, I am deploying using the following custom 'stack' on 1.16 GKE with Istio 1.5.10:

`# Pipelines
- ../../pipeline/installs/multi-user
- ../../pipeline/upstream/env/gcp/inverse-proxy
- ../../pipeline/upstream/env/gcp/minio-gcs-gateway
- ../../pipeline/upstream/env/gcp/cloudsql-proxy
- ../../argo/base_v3
configMapGenerator:
  - name: pipeline-install-config
    envs:
      - ./config/params.env
    behavior: merge
  - name: kubeflow-pipelines-profile-controller-env
    envs:
      - ./config/kubeflow-pipelines-profile-controller-env.env
    behavior: merge
  - name: kubeflow-config
    envs:
      - ./config/kubeflow-params.env
secretGenerator:
  - name: mysql-secret
    envs:
      - ./config/params-db-secret.env
    behavior: merge
patchesStrategicMerge:
- ../../pipeline/upstream/env/gcp/gcp-configurations-patch.yaml`

I am facing one issue: I am getting the following error in the API server: client_manager.go:353] Failed to check if Minio bucket exists. Error: Access Denied. I am able to port forward across the Minio service normally and I can see that it has all the relevant access. When I jump into a curl container and try to curl the minio end I am getting an RBAC access denied. I imagine that the same is happening with Minio.

EDIT: My second issue was caused by firewall walls restricting webhook to run.

@stale
Copy link

stale bot commented Dec 28, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 28, 2020
@Bobgy
Copy link
Contributor Author

Bobgy commented Dec 30, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Dec 30, 2020
google-oss-robot pushed a commit to GoogleCloudPlatform/kubeflow-distribution that referenced this issue May 9, 2021
…pelines#4356 (#259)

* feat(kfp): use managed storage -- GCS and CloudSQL

* update

* Revert "update"

This reverts commit bf1f212.

* remove isSet

* fix bucket access permission

* rm unused cnrm/pipelines package

* feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters

* reset values

* reset values
zijianjoy pushed a commit to zijianjoy/gcp-blueprints that referenced this issue May 12, 2021
…pelines#4356 (GoogleCloudPlatform#259)

* feat(kfp): use managed storage -- GCS and CloudSQL

* update

* Revert "update"

This reverts commit bf1f212.

* remove isSet

* fix bucket access permission

* rm unused cnrm/pipelines package

* feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters

* reset values

* reset values
google-oss-robot pushed a commit to GoogleCloudPlatform/kubeflow-distribution that referenced this issue May 12, 2021
* (ASM) Add instruction about how to upgrade ASM

* (ASM) define in-package kpt value for ASM label

* Document ASM upgrade changes

* address comments

* admin email

* patch the quote around variables

* (profile-controller) Adopt v1.3.0-gcp profile controller  (#262)

* feat(kfp): use managed storage -- GCS and CloudSQL. Fixes kubeflow/pipelines#4356 (#259)

* feat(kfp): use managed storage -- GCS and CloudSQL

* update

* Revert "update"

This reverts commit bf1f212.

* remove isSet

* fix bucket access permission

* rm unused cnrm/pipelines package

* feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters

* reset values

* reset values

* (management) Specify namespace in wait-gcp command. Fix #252 (#263)

* chore: clean up pull-upstream.sh (#264)

* chore: kpt-set.sh fixes (#265)

* chore: Change pipeline/ to pipelines/ . Fix #268 (#269)

* (management) Specify namespace in wait-gcp command

* chore: Change pipeline/ to pipelines/

* update address comment

* resolve conflict

Co-authored-by: Yuan (Bob) Gong <4957653+Bobgy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet