Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

Open
noklam opened this issue Jul 3, 2021 · 20 comments
Open
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@noklam
Copy link
Contributor

noklam commented Jul 3, 2021

Description

When I have a pipeline like this, I expected once the execution of the node is finished, it should release the input to reduce memory footprint. I found this is not the case and causing Memory Error for my pipeline.
A -> B # Step1
B -> C # Step2
C -> D # Step3

For example, as soon as step1 is finished, there is no reason why we should keep A in memory anymore.

Context

It keeps unnecessary variables in memory. Originally I was trying to write some notes to dive deep into Kedro dataset management and found that the behavior is not what I would expect.

Steps to Reproduce

https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html

Expected Result

For a non-cached dataset, it should exist in memory only in its node. For CachedDataSet, it should be deleted as soon as it is not needed anymore.

Actual Result

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V):
  • Python version used (python -V):
  • Operating system and version:
@noklam noklam added the Issue: Bug Report 🐞 Bug that needs to be fixed label Jul 3, 2021
@datajoely
Copy link
Contributor

@merelcht
Copy link
Member

merelcht commented Jul 5, 2021

Hi @noklam, thanks for flagging this and writing up such a thorough explanation in your blog post 👏 😄
I will get to the bottom of why it's been implemented in this way and keep you up to date about what I find.

@merelcht
Copy link
Member

merelcht commented Jul 6, 2021

Hi @noklam, I've found that this is not actually a bug but explicitly designed this way.

The main reason being that pipeline inputs and outputs are special datasets. They can be used for debugging after the run of a pipeline. So this means that the release functionality has explicitly been developed for the intermediate datasets.

On top of that, input parameters are handled as MemoryDataSets and so if the check for data_set not in pipeline.inputs() didn't exist, parameters would be released from memory as well. The code was written a couple of years ago, and we haven't further investigated whether it would be a problem for the parameters to be released, but regardless pipeline inputs and outputs should be treated differently according to our philosophy.

@noklam
Copy link
Contributor Author

noklam commented Jul 6, 2021

@MerelTheisenQB

Thanks for the explanation. If it is only for debugging purposes, there should be a flag to enable this behavior rather than setting it as default? Maybe I am missing something here, but I don't see why parameters need to be treated specially. It should be okay to release it as long as it is not needed in the remaining nodes. A function should only depend on its inputs.

In addition, CacheDataSet is just MemoryDataSet, which means CacheDataSet will stay in memory forever. The purpose of using Cache is to reduce the I/O instead of writing to disk in 1 node then reading the same data from disk immediately. This is a big problem for large pipelines.

This is a common pattern for my training pipeline.

  1. preprocess raw data -> processed data
  2. processed data ->Train/test split
  3. processed data concat with some metadata or prediction label for analysis

I'll use CacheDataSet for these functions for sure because I don't want to write 10GB of data and reading it off from Disk (it could take minutes).

Thoughts?

@idanov
Copy link
Member

idanov commented Jul 6, 2021

@noklam Thanks for flagging this. This functionality was implemented long ago as @MerelTheisenQB pointed out and the main purpose is to enable interactive workflows more generally (and as extension to that debugging). E.g. the idea is that you could run a pipeline multiple times with the same catalog instance in the process of experimenting with different pipelines and if you start off with a MemoryDataSet inputs which are released after the run, that will make for a very painful experience.

Most of the time this shouldn't cause problems, since one would normally use MemoryDataSet as pipeline inputs only in interactive workflows. However you have correctly identified that CachedDataSet can be used in non-interactive workflows for inputs quite often and the desired behaviour would be to release the data once no longer needed.

For the case of pipeline outputs unconsumed by anyone after the run, there will be no reason for someone to define them as MemoryDataSet unless they work in interactive sessions (notebooks). Therefore the behaviour there is as expected.

CachedDataSet was not developed as a core Kedro component and it was done as internal contribution, so we might not have considered the issue you point out. That means that it's probably a good idea for us to revisit this and change the behaviour as follows:

  • Both MemoryDataSet and CachedDataSet should be released when used as intermediate datasets (Kedro already behaves like this to my knowledge) ✅
  • MemoryDataSet's used as pipeline inputs or outputs should not be released after runs to support interactive workflows (Kedro already behaves like this as claimed in this GitHub issue) ✅
  • CachedDataSet should be released after being used as pipeline input, but not as pipeline output (this is where Kedro fails user expectation) ❌

Would such behaviour suit your needs @noklam ?

@noklam
Copy link
Contributor Author

noklam commented Jul 6, 2021

@idanov Thanks, I rarely use Kedro in interactive mode and I can now see why this was implemented this way.

Yes, I think this match my expectation. Any datasets that are not in remanining nodes inputs should be released.

@datajoely
Copy link
Contributor

Created ticket on backlog

@merelcht merelcht changed the title Memory Leakage - Unexpected Caching behavior with CacheDataSet [KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataSet Jul 7, 2021
@acnazarejr
Copy link

acnazarejr commented Jul 9, 2021

I am facing the same problem, however, my dataset is a pandas.ParquetDataSet. Even after the node is terminated and the dataframe is saved to disk, the data is still in memory. The memory is only released after all pipeline execution.

Maybe I missed something, but should I indicate something to kedro so it doesn't keep the data in memory?

@datajoely
Copy link
Contributor

@acnazarejr - to confirm is this also using the CachedDataSet? Persisted writes to Parquet should discard the memory and be garbage collected by Python.

@acnazarejr
Copy link

@datajoely, sorry for the delayed response. I'm not using the CachedDataSet.

@datajoely
Copy link
Contributor

Hi @acnazarejr I don't think it should be behaving this way - how do you know the memory is not being released?

@stale
Copy link

stale bot commented Sep 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 12, 2021
@stale stale bot closed this as completed Sep 19, 2021
@datajoely datajoely reopened this Oct 6, 2021
@stale stale bot removed the stale label Oct 6, 2021
@stale
Copy link

stale bot commented Dec 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 5, 2021
@stale stale bot closed this as completed Dec 12, 2021
@merelcht merelcht reopened this Mar 7, 2022
@merelcht merelcht removed the stale label Mar 7, 2022
@merelcht merelcht added this to the Redesign Catalog and Datasets milestone Feb 6, 2023
@merelcht
Copy link
Member

This ticket hasn't had any recent activity, so I'm closing it.

@merelcht merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2023
@merelcht merelcht removed this from the Redesign the API for IO (catalog) milestone Feb 2, 2024
@noklam noklam reopened this Mar 21, 2024
@noklam noklam changed the title [KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataSet [KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset Mar 21, 2024
@noklam
Copy link
Contributor Author

noklam commented Mar 21, 2024

Reopen as this issue still exists, though usage of CacheDataset is not known. Partly due to this is an internal contribution and no docs has ever been written about it other than the API docs #3616.

E.g. the idea is that you could run a pipeline multiple times with the same catalog instance in the process of experimenting with different pipelines and if you start off with a MemoryDataSet inputs which are released after the run, that will make for a very painful experience.

This is a very interesting point and I would love to re-visit this at some points. It seems that years ago Kedro focus more on the interactive flow witht he idea of re-run pipeline with the same DataCatalog, which pretty much is now blocked by the KedroSession unless users use Runner ,DataCatalog directly.

  • Both MemoryDataSet and CachedDataSet should be released when used as intermediate datasets (Kedro already behaves like this to my knowledge) ✅
  • MemoryDataSet's used as pipeline inputs or outputs should not be released after runs to support interactive workflows (Kedro already behaves like this as claimed in this GitHub issue) ✅
  • CachedDataSet should be released after being used as pipeline input, but not as pipeline output (this is where Kedro fails user expectation) ❌

The last point is the real problem here. We may not need to fix this issue immediately, but I think it's important if we think about re-running pipeline in a notebook.

@datajoely
Copy link
Contributor

Could I also ask - why not make MemoryDatasets cache by default?

@noklam
Copy link
Contributor Author

noklam commented Mar 21, 2024

@datajoely Can you clarify a bit? MemoryDataset do "cache" in the sense that it only get load once only when passing around nodes. CacheDataset is slightly different because it is used as a wrapper dataset, under the hood it use MemoryDataset as its cache.

@datajoely
Copy link
Contributor

I guess my question is - if we can, why not make regular datasets cache by default?

@noklam
Copy link
Contributor Author

noklam commented Mar 21, 2024

@datajoely I think it's because there are 1% edge case that is not solved. I very much welcome this to be changed.

df.to_csv("my_csv.csv")
df2 = pd.read_csv("my_csv.csv")

In most case they will be identical, in badly typed cases they are different, read/load separately is more closed to an orchestrator mode I guess. This is very bad because depends where you start running your pipeline, you get different outputs.

@datajoely
Copy link
Contributor

Got it! They say cache invalidation is the 2nd hardest problem for a reason!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Status: To Do
Development

No branches or pull requests

5 participants