Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Partitioned Dataset Lazy Saving example more robust #3052

Open
cverluiseQB opened this issue Sep 20, 2023 · 5 comments
Open

Make Partitioned Dataset Lazy Saving example more robust #3052

cverluiseQB opened this issue Sep 20, 2023 · 5 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@cverluiseQB
Copy link

cverluiseQB commented Sep 20, 2023

Partitioned Dataset Lazy Saving

Problem

When using partitioned datasets with lazy saving in Kedro, following the current documentation example, the key value mapping is messed up. Specifically, the same value is saved for all keys.

Expected Behavior

Without lazy loading (lambda), everything works as expected, with keys and values correctly associated.

Example

Expand example (on click)

Example dataset

! mkdir ../data/01_raw/bug
! parallel 'touch ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j
! parallel 'echo "{}" >>  ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j

Example catalog

input_files:
    type: PartitionedDataset
    path: ../data/01_raw/bug/
    dataset: pandas.CSVDataSet
    filename_suffix: .csv
    overwrite: True
    
output_files:
    type: PartitionedDataset
    path: ../data/02_intermediate/bug/
    dataset: pandas.CSVDataSet
    filename_suffix: .csv
    overwrite: True

Example node

def copy_paste(input_loader):
    df = input_loader()
    return df

def copy_paste_node(input_files):
    return {k: lambda: copy_paste(v) for k, v in input_files.items()}

Run node

from kedro.io import DataCatalog

# Create a data catalog from the configuration
conf_catalog = yaml.safe_load(catalog)
data_catalog = DataCatalog.from_config(conf_catalog)

# data input
input_files = data_catalog.load("input_files")
# function call
output_files = copy_paste_node(input_files)
# data output
data_catalog.save("output_files", output_files)

Observe bug

from pathlib import Path

filepaths = Path("../data/02_intermediate/bug/").glob("*.csv")
for filepath in filepaths:
    print(filepath.name, filepath.read_text(), sep=": ")

# a.csv: j -> Expected a.csv: a
# c.csv: j -> Expected c.csv: c
# b.csv: j -> Expected b.csv: b
# f.csv: j -> ...
# g.csv: j
# e.csv: j
# d.csv: j
# i.csv: j
# h.csv: j
# j.csv: j

Fix

Complete the lambda with arguments to make it aware of the actual value(s) passed to address the lambda scoping issue.

In our example, this means the following:

def copy_paste_node(input_files):
    return {k: lambda v=v: copy_paste(v) for k, v in input_files.items()}

# `lambda:` becomes `lambda v=v:`

Environment

  • Kedro Version: 0.18.3
  • Operating System: macOS

Resources

Slack discussion 🤗

@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

Thank you for raising this issue, it's very well written. Our team will have a look shortly.

@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

From my understanding, this is not a Kedro problem. It's how Lambda variable scope work. See this example

In [7]: iterable = [lambda: print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: print("Assign the variable to lambda scope")
   ...: 
   ...: iterable = [lambda x=x : print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: 
3
3
3
3
Assign the variable to lambda scope
0
1
2
3

This StackOverFlow thread explains better: https://crawler.algolia.com/admin/crawlers/189d20ee-337e-4498-8a4c-61238789942e/overview

@cverluiseQB
Copy link
Author

In that case, I think that this is mainly a matter a documenting the right approach in the doc. wdyt?

@noklam noklam added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Sep 20, 2023
@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

I have marked this as a documentation effort.

I suggest that for the person who pick up this ticket, we can put a note section to warn about the scope of lambda, and check if we can improve our docs example, maybe add this to a faq.

@astrojuanlu astrojuanlu changed the title Partitioned Dataset Lazy Saving [bug/doc] Make Partitioned Dataset Lazy Saving example more robust Sep 20, 2023
@stichbury
Copy link
Contributor

This is something to tackle as part of #2941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

3 participants