Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] the cache server does not check whether the cached artifacts have been deleted #7939

Open
juliusvonkohout opened this issue Jun 24, 2022 · 27 comments

Comments

@juliusvonkohout
Copy link
Member

Environment

Kubeflow 1.5.1

Steps to reproduce

The cache server wrongly states that the step is cached even if the artifacts have been deleted from S3. We propose #7938 that checks whether the folder actually still exists on S3. If the folder does not exist we do not wrongfully pretend that it is cached and instead delete the cachedb entry and let the pipeline step run again.


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@chensun
Copy link
Member

chensun commented Jul 7, 2022

I'm not entirely sure about v1 behavior, but in v2, we don't make any assumption on whether an artifact has content in its path. For instance, Metrics artifact has nothing in its cloud storage path but only metadata in the MLMD database. So it's by design that we don't check the "existence" of artifacts with or without caching.

You could disable caching when submitting a pipeline run though, the SDK client has a parameter for that.

@juliusvonkohout
Copy link
Member Author

I'm not entirely sure about v1 behavior, but in v2, we don't make any assumption on whether an artifact has content in its path. For instance, Metrics artifact has nothing in its cloud storage path but only metadata in the MLMD database. So it's by design that we don't check the "existence" of artifacts with or without caching.

You could disable caching when submitting a pipeline run though, the SDK client has a parameter for that.

This is a fix for v1 behavior only. I do not know about v2 behavior. The pipeline just breaks, if the S3 content is deleted so somehow that must be fixed. Do you have another proposal?

@chensun
Copy link
Member

chensun commented Jul 14, 2022

No, I think it's by design. If users deletes artifacts, shouldn't that be considered user error?
The solution can be run again with cache option disabled.

@chensun chensun closed this as completed Jul 14, 2022
@MatthiasCarnein
Copy link

MatthiasCarnein commented Jul 15, 2022

I believe that @juliusvonkohout raised a very important issue here. In many scenarios users are legally required to cleanup old records and keeping the cachedb in sync is definitely a pain point.

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (#6204 (comment)).

@juliusvonkohout
Copy link
Member Author

I believe that @juliusvonkohout raised a very important issue here. In many scenarios users are legally required to cleanup old records and keeping the cachedb in sync is definitely a pain point.

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (#6204 (comment)).

Yes, @chensun please reopen, it is clearly a bug.

@chensun chensun reopened this Jul 20, 2022
@juliusvonkohout
Copy link
Member Author

@chensun as a follow up to the meeting i do not see a global setting here https://www.kubeflow.org/docs/components/pipelines/overview/caching/

@chensun
Copy link
Member

chensun commented Jul 20, 2022

Discussed this in today KFP community meeting, and I want to summarize it here:

In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (#6204 (comment)).

Yes, @chensun please reopen, it is clearly a bug.

First of all, what you quoted from @Bobgy (#6204 (comment)) doesn't conflict with what I said above.

I didn't mean the action of deleting artifacts from cloud storage itself is a user error. My reply is within the specific context of this issue, deleting the artifacts while replying on caching to continue working is a user error. It's like in C, you have a pointer that points to some allocated memory, freeing the memory is absolutely normal, but keep accessing the pointer would be a user error.

If users would delete artifacts periodically, they should probably configure pipelines runs with a cache staleness setting (understood it's different from cache lifecycle but I think still solves this use case). Or they can disable caching for new runs after artifact deletion. From this point, this is a not a bug.

I'm against the proposed fix because as I mentioned an artifact has a remote storage URI and metadata, both are optional. We can't assume there's content in the URI, and I'm not sure if we can assume the URI always exists even if users never write to it--it's rather an implementation detail that may vary for different platforms. We could check and verify this, but I don't think cache server should care at all an external system.

I personally would prefer to improve the caching configuration story--if the existing caching setting is not that obvious or easy to use, maybe we can add a global config. Or as @connor-mccarthy mentioned, maybe we provide a way for users to delete cache entires in the database--it could be as simple as a script to begin with as long as it's well documented.

@MatthiasCarnein
Copy link

I think the difference is that the user is not actively "relying on caching". For the user it just happens automatically when using the same component definition and arguments. Having this trigger automatically but then fail is confusing for users.

If I understand you correctly, you are suggesting that a user should manually disable caching when the corresponding artifact was deleted. But how would that work in practice? Does the user have to figure out the component definition and arguments that produced the artifact and note them down to disable caching whenever accidentally using the same definitions again in the future? It would be impossible to keep track of all items that the cache server still considers cached but are already deleted from storage (the cachedb is always growing and never cleaned after all).

I agree that there are other approaches to solve this problem. For example you could have an option somewhere in the UI to delete an artifact which would clean the artifact from S3, mlmd and the cachedb (but then again that's not supported by mlmd as far as I know).

But recommending to delete artifacts from S3 is incompatible with the current caching behaviour in my eyes.

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Jul 21, 2022

I personally would prefer to improve the caching configuration story--if the existing caching setting is not that obvious or easy to use, maybe we can add a global config. Or as @connor-mccarthy mentioned, maybe we provide a way for users to delete cache entires in the database--it could be as simple as a script to begin with as long as it's well documented.

We cannot expect the user to set it for every pipeline himself. That is too error prone as @MatthiasCarnein explains. We need a global expiration setting (maximum cache lifetime), for example an environment variable in the cache server that is respected by V1 and V2 pipelines. This setting could also enforce the deletion of old database entries according to the expiration setting. So we could reuse the parts of my PR that clean the database entries (cachedb and maybe mlmd for v2 ). Then we can use a lifecyclepolicy on the object storage with the same timeframe. The user can use a cache staleness up to the cache expiration duration then. Iit would be a user error then if he manually deletes Artifacts before the lifecycle policies does.

Does this sound reasonable to you?

@chensun
Copy link
Member

chensun commented Jul 22, 2022

I think the difference is that the user is not actively "relying on caching". For the user it just happens automatically when using the same component definition and arguments. Having this trigger automatically but then fail is confusing for users.

That's a fair argument. On the other hand though, I don't think it's too much to ask for from the users to understand that KFP has caching on by default. After all, I perceive KFP as an advanced developer tool. Users probably should understand the possible impacts of deleting artifacts.

If I understand you correctly, you are suggesting that a user should manually disable caching when the corresponding artifact was deleted. But how would that work in practice? Does the user have to figure out the component definition and arguments that produced the artifact and note them down to disable caching whenever accidentally using the same definitions again in the future?

In practice, how do users deleting artifacts? I guess they don't cherry-pick specific artifacts to delete but rather delete artifacts in bulk under some root path. Then they could disable caching for all their pipelines or a set of them. If they do delete artifacts in a precise cherry-picking manner, they probably tracks what pipelines and inputs produced those artifacts already.

I agree that there are other approaches to solve this problem. For example you could have an option somewhere in the UI to delete an artifact which would clean the artifact from S3, mlmd and the cachedb (but then again that's not supported by mlmd as far as I know).

But recommending to delete artifacts from S3 is incompatible with the current caching behaviour in my eyes.

Indeed, if KFP provides the feature for user to delete artifacts, which is what Bobgy suggested in the other thread, then it should be responsible for cleaning up all the related items that should and can be deleted--cache entry definitely falls into this category.
But the current state is KFP doesn't offer such feature, and users are deleting artifacts on their own, which happens outside the KFP system. I don't this KFP should be required to respond to external actions.

In summary, I'm supportive of the following options:

  • Make artifact deleting a KFP feature so that KFP should do the full cleanup
  • Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.
  • Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

@juliusvonkohout
Copy link
Member Author

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Shall i implement the last item or do you want do it? It would just be an environment variable MAX_CACHE_STALENESS in the cache-server derived from the configmap pipeline-install-config that uses the same syntax as per pipeline cache settings (e.g. P30D). Then in the cache server we modify

annotations[ExecutionKey] = executionHashKey
labels[CacheIDLabelKey] = ""
var maxCacheStalenessInSeconds int64 = -1
maxCacheStaleness, exists := annotations[MaxCacheStalenessKey]
if exists {
maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness)
}
var cachedExecution *model.ExecutionCache
cachedExecution, err = clientMgr.CacheStore().GetExecutionCache(executionHashKey, maxCacheStalenessInSeconds)
and
// Convert RFC3339 Duration(Eg. "P1DT30H4S") to int64 seconds.
func getMaxCacheStaleness(maxCacheStaleness string) int64 {
var seconds int64 = -1
if d, err := duration.Parse(maxCacheStaleness); err == nil {
seconds = int64(d / time.Second)
}
return seconds
}

to read the environment variable instead of -1.

For example with

    if exists { 
        maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness) 
    } 
    else {
        maxCacheStalenessInSeconds = getMaxCacheStaleness("pod Environment variable maxCacheStaleness") 
    }

Then in

func (s *ExecutionCacheStore) GetExecutionCache(executionCacheKey string, maxCacheStaleness int64) (*model.ExecutionCache, error) {
if maxCacheStaleness == 0 {
return nil, fmt.Errorf("MaxCacheStaleness=0, Cache is disabled.")
}
r, err := s.db.Table("execution_caches").Where("ExecutionCacheKey = ?", executionCacheKey).Rows()
if err != nil {
return nil, fmt.Errorf("Failed to get execution cache: %q", executionCacheKey)
}
defer r.Close()
executionCaches, err := s.scanRows(r, maxCacheStaleness)
and
if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
executionCaches = append(executionCaches, &model.ExecutionCache{
ID: id,
ExecutionCacheKey: executionCacheKey,
ExecutionTemplate: executionTemplate,
ExecutionOutput: executionOutput,
MaxCacheStaleness: maxCacheStaleness,
StartedAtInSec: startedAtInSec,
EndedAtInSec: endedAtInSec,
})
}
}
return executionCaches, nil
}

it will just be properly used. So we add three lines and an environment variable via the global configmap to satisfy most of the use cases.

If desired we could also add a CACHE_EXPIRATION variable and delete the database entry with

func (s *ExecutionCacheStore) DeleteExecutionCache(executionCacheID string) error {
db := s.db.Delete(&model.ExecutionCache{}, "ID = ?", executionCacheID)
if db.Error != nil {
return db.Error
}
return nil
}
if it is too old.

For example after

func (s *ExecutionCacheStore) scanRows(rows *sql.Rows, podMaxCacheStaleness int64) ([]*model.ExecutionCache, error) {
var executionCaches []*model.ExecutionCache
for rows.Next() {
var executionCacheKey, executionTemplate, executionOutput string
var id, maxCacheStaleness, startedAtInSec, endedAtInSec int64
err := rows.Scan(
&id,
&executionCacheKey,
&executionTemplate,
&executionOutput,
&maxCacheStaleness,
&startedAtInSec,
&endedAtInSec)
if err != nil {
return executionCaches, nil
}
log.Println("Get id: " + strconv.FormatInt(id, 10))
log.Println("Get template: " + executionTemplate)
if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
executionCaches = append(executionCaches, &model.ExecutionCache{
ID: id,
ExecutionCacheKey: executionCacheKey,
ExecutionTemplate: executionTemplate,
ExecutionOutput: executionOutput,
MaxCacheStaleness: maxCacheStaleness,
StartedAtInSec: startedAtInSec,
EndedAtInSec: endedAtInSec,
})
}
we could add

		if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
			...
		}
                if s.time.Now().UTC().Unix()-startedAtInSec > getMaxCacheStaleness("CACHE_EXPIRATION variable")  {
                    err = executionCacheStore.DeleteExecutionCache(id)
                }

Maybe it would even be better if we would just periodically execute an SQL command that cleans all entries older than CACHE_EXPIRATION in the cachedb table.

@juliusvonkohout
Copy link
Member Author

And is there a documentation for V2 caching?

if isV2Pod(&pod) {
// KFP v2 handles caching by its driver.
log.Printf("This pod %s is created by KFP v2 pipelines.", pod.ObjectMeta.Name)
return nil, nil
}

@TobiasGoerke
Copy link
Contributor

TobiasGoerke commented Aug 4, 2022

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature.

However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways.

Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory:

kubeflow_cache

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here.

I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

@chensun
Copy link
Member

chensun commented Aug 4, 2022

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Shall i implement the last item or do you want do it? It would just be an environment variable MAX_CACHE_STALENESS in the cache-server derived from the configmap pipeline-install-config that uses the same syntax as per pipeline cache settings (e.g. P30D). Then in the cache server we modify

annotations[ExecutionKey] = executionHashKey
labels[CacheIDLabelKey] = ""
var maxCacheStalenessInSeconds int64 = -1
maxCacheStaleness, exists := annotations[MaxCacheStalenessKey]
if exists {
maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness)
}
var cachedExecution *model.ExecutionCache
cachedExecution, err = clientMgr.CacheStore().GetExecutionCache(executionHashKey, maxCacheStalenessInSeconds)

and

// Convert RFC3339 Duration(Eg. "P1DT30H4S") to int64 seconds.
func getMaxCacheStaleness(maxCacheStaleness string) int64 {
var seconds int64 = -1
if d, err := duration.Parse(maxCacheStaleness); err == nil {
seconds = int64(d / time.Second)
}
return seconds
}

to read the environment variable instead of -1.
For example with

    if exists { 
        maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness) 
    } 
    else {
        maxCacheStalenessInSeconds = getMaxCacheStaleness("pod Environment variable maxCacheStaleness") 
    }

Then in

func (s *ExecutionCacheStore) GetExecutionCache(executionCacheKey string, maxCacheStaleness int64) (*model.ExecutionCache, error) {
if maxCacheStaleness == 0 {
return nil, fmt.Errorf("MaxCacheStaleness=0, Cache is disabled.")
}
r, err := s.db.Table("execution_caches").Where("ExecutionCacheKey = ?", executionCacheKey).Rows()
if err != nil {
return nil, fmt.Errorf("Failed to get execution cache: %q", executionCacheKey)
}
defer r.Close()
executionCaches, err := s.scanRows(r, maxCacheStaleness)

and

if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
executionCaches = append(executionCaches, &model.ExecutionCache{
ID: id,
ExecutionCacheKey: executionCacheKey,
ExecutionTemplate: executionTemplate,
ExecutionOutput: executionOutput,
MaxCacheStaleness: maxCacheStaleness,
StartedAtInSec: startedAtInSec,
EndedAtInSec: endedAtInSec,
})
}
}
return executionCaches, nil
}

it will just be properly used. So we add three lines and an environment variable via the global configmap to satisfy most of the use cases.

If desired we could also add a CACHE_EXPIRATION variable and delete the database entry with

func (s *ExecutionCacheStore) DeleteExecutionCache(executionCacheID string) error {
db := s.db.Delete(&model.ExecutionCache{}, "ID = ?", executionCacheID)
if db.Error != nil {
return db.Error
}
return nil
}

if it is too old.
For example after

func (s *ExecutionCacheStore) scanRows(rows *sql.Rows, podMaxCacheStaleness int64) ([]*model.ExecutionCache, error) {
var executionCaches []*model.ExecutionCache
for rows.Next() {
var executionCacheKey, executionTemplate, executionOutput string
var id, maxCacheStaleness, startedAtInSec, endedAtInSec int64
err := rows.Scan(
&id,
&executionCacheKey,
&executionTemplate,
&executionOutput,
&maxCacheStaleness,
&startedAtInSec,
&endedAtInSec)
if err != nil {
return executionCaches, nil
}
log.Println("Get id: " + strconv.FormatInt(id, 10))
log.Println("Get template: " + executionTemplate)
if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
executionCaches = append(executionCaches, &model.ExecutionCache{
ID: id,
ExecutionCacheKey: executionCacheKey,
ExecutionTemplate: executionTemplate,
ExecutionOutput: executionOutput,
MaxCacheStaleness: maxCacheStaleness,
StartedAtInSec: startedAtInSec,
EndedAtInSec: endedAtInSec,
})
}

we could add

		if maxCacheStaleness == -1 || s.time.Now().UTC().Unix()-startedAtInSec <= podMaxCacheStaleness {
			...
		}
                if s.time.Now().UTC().Unix()-startedAtInSec > getMaxCacheStaleness("CACHE_EXPIRATION variable")  {
                    err = executionCacheStore.DeleteExecutionCache(id)
                }

Maybe it would even be better if we would just periodically execute an SQL command that cleans all entries older than CACHE_EXPIRATION in the cachedb table.

@juliusvonkohout Thank you for the write up. Given this is a nontrivial user interface change. I would suggest you write a design doc so that we can conduct a proper design review by our team and the community.

@chensun
Copy link
Member

chensun commented Aug 4, 2022

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature.

However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways.

Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory:

kubeflow_cache kubeflow_cache

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here.

I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

Thanks @TobiasGoerke , and same suggestion as writing a design doc so that we can review it properly.

@TobiasGoerke
Copy link
Contributor

Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature.
However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways.
Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory:
kubeflow_cache

    [
      
        ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
      
    ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
    
    
      
        
          
        
        
          
          
        
      
      [
        
          
        
      ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
    
   [ ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif) ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
  
    [
      
        ![kubeflow_cache](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
      
    ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
    
    
      
        
          
        
        
          
          
        
      
      [
        
          
        
      ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)
    
   [ ](https://user-images.githubusercontent.com/13769461/182807832-5fcd0b49-5b84-4172-9b0a-4904c190dace.gif)

Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here.
I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it.

Thanks @TobiasGoerke , and same suggestion as writing a design doc so that we can review it properly.

I've created a draft. What's the workflow for proposals? Where to post this proposal? Thanks.

@juliusvonkohout
Copy link
Member Author

@TobiasGoerke just write me or Diana atanasova on the kubeflow slack. We have some experience with those drafts.

@IronPan
Copy link
Member

IronPan commented Sep 27, 2022

In summary, I'm supportive of the following options:

* Make artifact deleting a KFP feature so that KFP should do the full cleanup

* Provide a script or API for user to delete cache entry, so they can easily do so after they delete artifacts on their own.

* Make a global caching config, so users can set cache lifecycle/staleness that matches their artifact deletion policy or need.

Is job level caching config an option? The global caching config might be inflexible if different pipelines artifacts have different TTL? Especially if multiple users are sharing the same environment.

@juliusvonkohout
Copy link
Member Author

@IronPan

You can still set a per pipeline and per step cache staleness. Please check #8270 (comment) and the design document linked there.

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Oct 14, 2022

Sadly we have to be agnostic of the backend until the new unified storage architecture from #7725 (comment) is there. So i got #8270 in instead of #7938 . V2 pipelines are not yet covered.

@juliusvonkohout
Copy link
Member Author

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Apr 29, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@AndersBennedsgaard
Copy link

@juliusvonkohout could we have this opened again?

@juliusvonkohout
Copy link
Member Author

@AndersBennedsgaard

/reopen

Copy link

@juliusvonkohout: Reopened this issue.

In response to this:

@AndersBennedsgaard

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@google-oss-prow google-oss-prow bot reopened this Jun 10, 2024
@juliusvonkohout
Copy link
Member Author

/lifecycle frozen

@google-oss-prow google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale The issue / pull request is stale, any activities remove this label. labels Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment