Kubeflow on GCP - Support managed storage #4356

Bobgy · 2020-08-12T03:03:15Z

Google Cloud managed storages (GCS and CLoud SQL) make it easier for users to manage, backup and restore KFP data.
They are not currently supported for Kubeflow on GCP, we'd need some user feedback first before supporting them.

Please provide your feedback if this is important to you.

Ark-kun · 2020-08-12T03:11:22Z

Having data in storage outside of the cluster makes it more accessible. We've seen several users having problems trying to get access to this data. Latest one is #4327

When the data is inside the cluster, the MLMD is less useful. It stores the URIs, but they cannot be accessed directly.

parthmishra · 2020-08-12T14:35:43Z

I thought Kubeflow on GCP has an option to enable Cloud SQL? What makes GCS/Cloud SQL incompatible with the full Kubeflow 1.0.2 installation? I thought it would be as simple as applying the cloudsqlproxy and minio patches, but I haven't tried.

Regarding importance, I also think managed storage via Cloud SQL and GCS is fairly important for us (and IMO, should probably just be the default on GCP). One thing that managed storage simplifies is lifecycle management, scaling, and high availability of cluster data.

Ark-kun · 2020-08-12T20:00:26Z

I also think managed storage via Cloud SQL and GCS is fairly important for us

What is the relative importance of these features for you? How would you rank the following: "GCS without CLoudSQL", "CloudSQL without GCS", "GCS with CloudSQL".

parthmishra · 2020-08-13T00:07:55Z

What is the relative importance of these features for you? How would you rank the following: "GCS without CLoudSQL", "CloudSQL without GCS", "GCS with CloudSQL".

I think I'd rank them:

GCS with CloudSQL
CloudSQL without GCS
GCS without CloudSQL

I think this is obvious, having a stateless cluster reduces the operational burden and complexity significantly, as well as the reasons you guys brought up. For the second and third choices, I think it's tough but ultimately the operational burden to support an in-cluster hosted DB backed by a persistent disk outweighs the benefit of externalizing artifact storage on GCS. For the most part, we do that anyways but having it abstracted via minio is a nice-to-have.

sm-hawkfish · 2020-08-13T14:52:43Z

Another +1 on support for managed storage, which we are using via AI Platform Pipelines.

The CloudSQL database has opened up access to MLMD (and other metadata) to other applications (like an in-house model QC dashboard). So this support enables functionality and gives confidence from a business perspective that the audit trail of interactions with our system is safe.
Regarding GCS support, I would add to the above comments that it makes debugging easier, since users can fetch artifacts from GCS (e.g. a sample of training data) for local testing. It is also more of a defensive feature, since it makes it more likely that your users will use the system the way you intend. For example: KFP recommends that components read/write from local files and leave it up to the infrastructure to store them. If this backing store is a kubernetes volume, then I may not feel comfortable with this as the primary storage location of the artifacts. As a result, I would end up circumventing the KFP system in that I might manually read/write from GCS and pass only the URIs between components.

Bobgy · 2020-08-14T09:10:54Z

Thanks for the detailed feedbacks!

Forgot to ask one more question: what do you think making managed storage the default option on GCP?

Ark-kun · 2020-08-15T04:10:41Z

For example: KFP recommends that components read/write from local files and leave it up to the infrastructure to store them. If this backing store is a kubernetes volume, then I may not feel comfortable with this as the primary storage location of the artifacts. As a result, I would end up circumventing the KFP system in that I might manually read/write from GCS and pass only the URIs between components.

Interesting. I would have expected the opposite: With artifacts being in the open GCS instead of in-cluster PD, the users are free to tinker with them, break the immutability guarantees and have untracked side-channel data access. While with more black-box store, they can only work with the intermediate data via the system channels.
It's an interesting data point.

marrrcin · 2020-08-17T09:41:30Z

I'm surprised to find this topic on github as I thought that newest version supports managed storage out-of-the-box, especially because there is a KFP link related to managed storage configuration:
https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/gcp

Right now we're facing the issues with migration to another Kubeflow cluster, we wanted to persist all pipeline runs (experiments) but it seems to be really laborious task with current setup (in-cluster PDs and MySQL).

I totally agree with @parthmishra - those are basically my feelings around this topic:

I thought Kubeflow on GCP has an option to enable Cloud SQL? What makes GCS/Cloud SQL incompatible with the full Kubeflow 1.0.2 installation? I thought it would be as simple as applying the cloudsqlproxy and minio patches, but I haven't tried.

Regarding importance, I also think managed storage via Cloud SQL and GCS is fairly important for us (and IMO, should probably just be the default on GCP). One thing that managed storage simplifies is lifecycle management, scaling, and high availability of cluster data.

@Bobgy
I agree that managed storage should be default on GCP, including CloudSQL + GCS. This way it will be possible to delete and restore the clusters at will, without the pain of loosing existing experiments and run's outputs. GCS storage can be configured with bucket only policy for a specific service account, which will forbid external/accidental tinkering with the data stored by Kubeflow.

willypicard · 2020-08-17T14:53:47Z

I completely agree with @marrrcin and @Bobgy: managed storage, enabled by default on GCP, would allow for easy cluster backups and cluster sizing. Definitively a highly important topic!

wmikolajczyk · 2020-08-18T07:51:23Z

I'm facing the same issues (as @marrrcin) with moving my experiments between Kubeflow clusters.
@Bobgy
I would love to have support for managed storage on GCP

Bobgy · 2020-08-18T08:47:38Z

To answer some of the questions:

Yes, KFP standalone https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/env/gcp supports managed storage on GCP.

For full Kubeflow, the 1.1 release timeline was fairly tight, I was only able to get KFP multi-user support there, but didn't manage to tackle managed storage (although it isn't a lot of work, I should be able to reuse the standalone configuration and then write some documentation). I created this issue to get some early feedback and considering the positive feedback, I'll see if I can prioritize making managed storage as an option in the Kubeflow 1.1.1.

Starting from next Kubeflow minor release will be a good timing we can consider changing it as the default.

Bobgy · 2020-08-18T08:53:45Z

FYI, I created another issue for supporting upgrading Kubeflow 1.0.2 to Kubeflow 1.1.0 while keeping data: #4346. It's agnostic to managed storage, because KFP switched from shared mode to multi-user mode in Kubeflow 1.1.0 and DB migration will be required to keep your data.

Looks like some of you in this thread will be interested in that topic too, so I am talking about it here.

rmgogogo · 2020-08-19T00:25:23Z

Hi Yuan, assume your mean this ticket is for Kubeflow fullset. Not Kubeflow Pipeline (standalone and Hosted/CAIP Pipeline).

Bobgy · 2020-08-19T07:47:47Z

@rmgogogo Yes, this issue is for the full Kubeflow installation.

RoyerRamirez · 2020-09-03T05:33:19Z

Hi @jlewi, I joined the "Kubeflow Community Pipelines Meeting" today, and was hoping we could add CloudSQL and GCS support for the new kubeflow v1.1.* deployment. We are getting to ready to deploy in the next few weeks, and it would be awesome if we could get this feature out. We think it would make Kubeflow more stable and easier to maintain in production. Otherwise, we'll go ahead and deploy the standard version.

~ Royer

jlewi · 2020-09-14T13:23:37Z

@RoyerRamirez I'm not working on this issue. I would suggest working with the KFP folks and the folks mentioned in this issue to see about prioritizing it if you think its important.

Bobgy · 2020-09-16T02:35:33Z

I've got enough feedback and will work on this next ~~week~~ month, but there are some uncertainties how we can release this (as an option or as default). So please do not wait for this if you are onto something urgent.

connorlwilkes · 2020-09-29T09:04:02Z

Hi, just a few findings from my end when deploying with cloudsql and GCS. Firstly context, I am deploying using the following custom 'stack' on 1.16 GKE with Istio 1.5.10:

`# Pipelines
- ../../pipeline/installs/multi-user
- ../../pipeline/upstream/env/gcp/inverse-proxy
- ../../pipeline/upstream/env/gcp/minio-gcs-gateway
- ../../pipeline/upstream/env/gcp/cloudsql-proxy
- ../../argo/base_v3
configMapGenerator:
  - name: pipeline-install-config
    envs:
      - ./config/params.env
    behavior: merge
  - name: kubeflow-pipelines-profile-controller-env
    envs:
      - ./config/kubeflow-pipelines-profile-controller-env.env
    behavior: merge
  - name: kubeflow-config
    envs:
      - ./config/kubeflow-params.env
secretGenerator:
  - name: mysql-secret
    envs:
      - ./config/params-db-secret.env
    behavior: merge
patchesStrategicMerge:
- ../../pipeline/upstream/env/gcp/gcp-configurations-patch.yaml`

I am facing one issue: I am getting the following error in the API server: client_manager.go:353] Failed to check if Minio bucket exists. Error: Access Denied. I am able to port forward across the Minio service normally and I can see that it has all the relevant access. When I jump into a curl container and try to curl the minio end I am getting an RBAC access denied. I imagine that the same is happening with Minio.

EDIT: My second issue was caused by firewall walls restricting webhook to run.

stale · 2020-12-28T22:20:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy · 2020-12-30T23:45:29Z

/lifecycle frozen

…pelines#4356 (#259) * feat(kfp): use managed storage -- GCS and CloudSQL * update * Revert "update" This reverts commit bf1f212. * remove isSet * fix bucket access permission * rm unused cnrm/pipelines package * feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters * reset values * reset values

…pelines#4356 (GoogleCloudPlatform#259) * feat(kfp): use managed storage -- GCS and CloudSQL * update * Revert "update" This reverts commit bf1f212. * remove isSet * fix bucket access permission * rm unused cnrm/pipelines package * feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters * reset values * reset values

* (ASM) Add instruction about how to upgrade ASM * (ASM) define in-package kpt value for ASM label * Document ASM upgrade changes * address comments * admin email * patch the quote around variables * (profile-controller) Adopt v1.3.0-gcp profile controller (#262) * feat(kfp): use managed storage -- GCS and CloudSQL. Fixes kubeflow/pipelines#4356 (#259) * feat(kfp): use managed storage -- GCS and CloudSQL * update * Revert "update" This reverts commit bf1f212. * remove isSet * fix bucket access permission * rm unused cnrm/pipelines package * feat(kfp): resource name includes KF_NAME & separate cloudsql-name & bucket-name setters * reset values * reset values * (management) Specify namespace in wait-gcp command. Fix #252 (#263) * chore: clean up pull-upstream.sh (#264) * chore: kpt-set.sh fixes (#265) * chore: Change pipeline/ to pipelines/ . Fix #268 (#269) * (management) Specify namespace in wait-gcp command * chore: Change pipeline/ to pipelines/ * update address comment * resolve conflict Co-authored-by: Yuan (Bob) Gong <4957653+Bobgy@users.noreply.github.com>

Bobgy self-assigned this Aug 12, 2020

Bobgy added status/triaged Whether the issue has been explicitly triaged area/deployment/kubeflow platform/gcp labels Aug 12, 2020

Bobgy mentioned this issue Nov 2, 2020

Kubeflow on GCP - 1.2.0 release tracker GoogleCloudPlatform/kubeflow-distribution#139

Closed

16 tasks

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 28, 2020

k8s-ci-robot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Dec 30, 2020

Bobgy mentioned this issue Jan 15, 2021

chore: delete outdated metadata google cloudsql overlay kubeflow/manifests#1718

Merged

1 task

This was referenced Feb 22, 2021

[KF1.2] GCP ComputeDisks missing required spec fields GoogleCloudPlatform/kubeflow-distribution#194

Closed

Kubeflow on GCP - 1.3 release tracker GoogleCloudPlatform/kubeflow-distribution#196

Open

bdoohan-goog mentioned this issue Apr 19, 2021

[backend] Managed Storage Option Fails on KFP #5504

Closed

Bobgy mentioned this issue May 6, 2021

feat(kfp): use managed storage -- GCS and CloudSQL. Fixes kubeflow/pipelines#4356 GoogleCloudPlatform/kubeflow-distribution#259

Merged

google-oss-robot closed this as completed in GoogleCloudPlatform/kubeflow-distribution#259 May 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubeflow on GCP - Support managed storage #4356

Kubeflow on GCP - Support managed storage #4356

Bobgy commented Aug 12, 2020 •

edited

Loading

Ark-kun commented Aug 12, 2020

parthmishra commented Aug 12, 2020

Ark-kun commented Aug 12, 2020

parthmishra commented Aug 13, 2020

sm-hawkfish commented Aug 13, 2020

Bobgy commented Aug 14, 2020

Ark-kun commented Aug 15, 2020

marrrcin commented Aug 17, 2020

willypicard commented Aug 17, 2020

wmikolajczyk commented Aug 18, 2020 •

edited

Loading

Bobgy commented Aug 18, 2020 •

edited

Loading

Bobgy commented Aug 18, 2020

rmgogogo commented Aug 19, 2020

Bobgy commented Aug 19, 2020

RoyerRamirez commented Sep 3, 2020

jlewi commented Sep 14, 2020

Bobgy commented Sep 16, 2020 •

edited

Loading

connorlwilkes commented Sep 29, 2020 •

edited

Loading

stale bot commented Dec 28, 2020

Bobgy commented Dec 30, 2020

Kubeflow on GCP - Support managed storage #4356

Kubeflow on GCP - Support managed storage #4356

Comments

Bobgy commented Aug 12, 2020 • edited Loading

Ark-kun commented Aug 12, 2020

parthmishra commented Aug 12, 2020

Ark-kun commented Aug 12, 2020

parthmishra commented Aug 13, 2020

sm-hawkfish commented Aug 13, 2020

Bobgy commented Aug 14, 2020

Ark-kun commented Aug 15, 2020

marrrcin commented Aug 17, 2020

willypicard commented Aug 17, 2020

wmikolajczyk commented Aug 18, 2020 • edited Loading

Bobgy commented Aug 18, 2020 • edited Loading

Bobgy commented Aug 18, 2020

rmgogogo commented Aug 19, 2020

Bobgy commented Aug 19, 2020

RoyerRamirez commented Sep 3, 2020

jlewi commented Sep 14, 2020

Bobgy commented Sep 16, 2020 • edited Loading

connorlwilkes commented Sep 29, 2020 • edited Loading

stale bot commented Dec 28, 2020

Bobgy commented Dec 30, 2020

Bobgy commented Aug 12, 2020 •

edited

Loading

wmikolajczyk commented Aug 18, 2020 •

edited

Loading

Bobgy commented Aug 18, 2020 •

edited

Loading

Bobgy commented Sep 16, 2020 •

edited

Loading

connorlwilkes commented Sep 29, 2020 •

edited

Loading