Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(frontend): caching may now be disabled when starting pipeline runs #8177

Closed
wants to merge 1 commit into from
Closed

Conversation

TobiasGoerke
Copy link
Contributor

As discussed in #8104 and the subsequent community meeting on 17. August, Kubeflow requires a way to manage its artifacts and cache.

We've agreed on proceeding and implementing suitable mechanisms step by step.

This first feature adds a switch that lets users disable or enable caching for new pipelines.

2022-08-22 13_54_50-Kubeflow Central Dashboard

Implementation notes:

  • I've included a mechanism to restore previous cache staleness values. I.e., when cloning runs and enabling/disabling caching, the initial values may always be restored.
  • The API/SDK doesn't support disabling caching when running referenced pipelines (e.g. see Python SDK). Thus, the workflow yaml needs to be fetched with another request when disabling the cache in this case.

@google-cla
Copy link

google-cla bot commented Aug 22, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@google-oss-prow
Copy link

Hi @TobiasGoerke. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@TobiasGoerke
Copy link
Contributor Author

Does this correspond to your ideas, @chensun and @zijianjoy?

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Aug 23, 2022

I discussed with @TobiasGoerke that we go for a disable cache switch in the pipeline run UI and a slightly modified version of #7939 (comment) . If the maximum cache staleness leads to an empty list from

var executionCaches []*model.ExecutionCache
then we should scan the database and delete all entries older than max_cache_staleness. This ensures that it is still performant in very large installations and solves another long standing issue of an indefinitely growing cachedb.

So TWO environment variables for the cache-server that provide a default (if the pipeline does not have values) AND maximum (larger values in the pipeline are ignored) cache staleness. Then an administrator can set the expiration date on his Minio/S3/GCS storage backend to the same value as the maximum cache staleness and provide a sensible staleness default value for its users pipelines. We limit the user-set value int he pipeline definition to the maximum value from the administrator. The users also do not need to recompile existing pipelines anymore because they can disable the cache from the UI. I think setting the exact cache duration from the UI is overkill and a disable/enable switch is enough.

I think this covers most usecases, is independent of the storage backend (compared to #7938) and rather easy to implement.

This is also very much in line what google wants according to the comments and the KFP community meeting.

@juliusvonkohout
Copy link
Member

/ok-to-test

@TobiasGoerke
Copy link
Contributor Author

TobiasGoerke commented Aug 31, 2022

As discussed with @zijianjoy (see google docs comments), disabling the cache for new runs / jobs is now implemented in the backend. The UI hasn't changed and still features a switch for controlling the cache.

Frontend and backend changes are implemented for both V1 and V2.

Feel free to review and test them.

Note: There's a new field disable_cache in the Run and Job protos. I've decided against declaring the new field in the CreateRunRequest or CreateJobRequest as this would break backwards compatibility (the body is currently flattened and run/job sent as the top-level object instead of the request object itself).

TODO: reviewing the v2_template.go. I'm unsure where or how to set the cachingOptions here. Also, when EnableCache is set to false, the field is removed entirely and the default value true is inferred automatically.

@TobiasGoerke
Copy link
Contributor Author

/retest

@TobiasGoerke TobiasGoerke marked this pull request as ready for review August 31, 2022 16:42
@google-oss-prow google-oss-prow bot requested a review from Linchin August 31, 2022 16:42
@TobiasGoerke TobiasGoerke changed the title [WIP] feat(frontend): caching may now be disabled when starting pipeline runs feat(frontend): caching may now be disabled when starting pipeline runs Aug 31, 2022
@TobiasGoerke TobiasGoerke changed the title feat(frontend): caching may now be disabled when starting pipeline runs [WIP] feat(frontend): caching may now be disabled when starting pipeline runs Aug 31, 2022
@juliusvonkohout
Copy link
Member

@TobiasGoerke can you rebase to master and fix the merge conflict?

Please try my new cache server at mtr.devops.telekom.de/ai/cache-server:latest with the environment variables MAXIMUM_CACHE_STALENESS=P0DT0H3M30S DEFAULT_CACHE_STALENESS=P0DT0H1M31S

@TobiasGoerke
Copy link
Contributor Author

TobiasGoerke commented Sep 13, 2022

@TobiasGoerke can you rebase to master and fix the merge conflict?

Please try my new cache server at mtr.devops.telekom.de/ai/cache-server:latest with the environment variables MAXIMUM_CACHE_STALENESS=P0DT0H3M30S DEFAULT_CACHE_STALENESS=P0DT0H1M31S

I've tested your changes and they work well for me. The cache is invalidated after a couple of minutes.

To reproduce and install this PR locally, I've pushed the ml-pipeline images to dockerhub.
Simply add the following to your top-level kustomization:

images:
  - name: gcr.io/ml-pipeline/cache-server
    newName: mtr.devops.telekom.de/ai/cache-server
    newTag: latest
  - name: gcr.io/ml-pipeline/frontend
    newName: tobiasgoerke/ml-pipeline-frontend
  - name: gcr.io/ml-pipeline/api-server
    newName: tobiasgoerke/ml-pipeline-api-server

patchesStrategicMerge:
- |
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: cache-server
    namespace: kubeflow
  spec:
    template:
      spec:
        containers:
        - name: server
          env:
          - name: MAXIMUM_CACHE_STALENESS
            value: P0DT0H3M30S
          - name: DEFAULT_CACHE_STALENESS
            value: P0DT0H1M31S

@juliusvonkohout
Copy link
Member

/test kubeflow-pipeline-frontend-test

@TobiasGoerke TobiasGoerke changed the title [WIP] feat(frontend): caching may now be disabled when starting pipeline runs feat(frontend): caching may now be disabled when starting pipeline runs Sep 14, 2022
@juliusvonkohout
Copy link
Member

APi changes are to be discussed with @Linchin

@juliusvonkohout
Copy link
Member

As discussed in todays KFP meeting I will break off the cacheserver stuff into another very small PR such that @chensun can review and merge it independently.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign linchin, zijianjoy for approval by writing /assign @linchin @zijianjoy in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…arting pipeline runs.

Added a new flag 'disable_cache' to run and job protos in API backend.
Backend handles this flag and sets caching annotations accordingly.
Frontend enables the user to set the flag by offering a switch button.
Squashed previous commits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants