Make Partitioned Dataset Lazy Saving example more robust #3052

cverluiseQB · 2023-09-20T09:23:03Z

Partitioned Dataset Lazy Saving

Problem

When using partitioned datasets with lazy saving in Kedro, following the current documentation example, the key value mapping is messed up. Specifically, the same value is saved for all keys.

Expected Behavior

Without lazy loading (lambda), everything works as expected, with keys and values correctly associated.

Example

Expand example (on click)

Example dataset

! mkdir ../data/01_raw/bug
! parallel 'touch ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j
! parallel 'echo "{}" >>  ../data/01_raw/bug/{}.csv' ::: a b c d e f g h i j

Example catalog

input_files:
    type: PartitionedDataset
    path: ../data/01_raw/bug/
    dataset: pandas.CSVDataSet
    filename_suffix: .csv
    overwrite: True
    
output_files:
    type: PartitionedDataset
    path: ../data/02_intermediate/bug/
    dataset: pandas.CSVDataSet
    filename_suffix: .csv
    overwrite: True

Example node

def copy_paste(input_loader):
    df = input_loader()
    return df

def copy_paste_node(input_files):
    return {k: lambda: copy_paste(v) for k, v in input_files.items()}

Run node

from kedro.io import DataCatalog

# Create a data catalog from the configuration
conf_catalog = yaml.safe_load(catalog)
data_catalog = DataCatalog.from_config(conf_catalog)

# data input
input_files = data_catalog.load("input_files")
# function call
output_files = copy_paste_node(input_files)
# data output
data_catalog.save("output_files", output_files)

Observe bug

from pathlib import Path

filepaths = Path("../data/02_intermediate/bug/").glob("*.csv")
for filepath in filepaths:
    print(filepath.name, filepath.read_text(), sep=": ")

# a.csv: j -> Expected a.csv: a
# c.csv: j -> Expected c.csv: c
# b.csv: j -> Expected b.csv: b
# f.csv: j -> ...
# g.csv: j
# e.csv: j
# d.csv: j
# i.csv: j
# h.csv: j
# j.csv: j

Fix

Complete the lambda with arguments to make it aware of the actual value(s) passed to address the lambda scoping issue.

In our example, this means the following:

def copy_paste_node(input_files):
    return {k: lambda v=v: copy_paste(v) for k, v in input_files.items()}

# `lambda:` becomes `lambda v=v:`

Environment

Kedro Version: 0.18.3
Operating System: macOS

Resources

Slack discussion 🤗

noklam · 2023-09-20T09:29:23Z

Thank you for raising this issue, it's very well written. Our team will have a look shortly.

noklam · 2023-09-20T11:06:39Z

From my understanding, this is not a Kedro problem. It's how Lambda variable scope work. See this example

In [7]: iterable = [lambda: print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: print("Assign the variable to lambda scope")
   ...: 
   ...: iterable = [lambda x=x : print(x) for x in range(4)]
   ...: 
   ...: for i in iterable:
   ...:     i()
   ...: 
   ...: 
3
3
3
3
Assign the variable to lambda scope
0
1
2
3

This StackOverFlow thread explains better: https://crawler.algolia.com/admin/crawlers/189d20ee-337e-4498-8a4c-61238789942e/overview

cverluiseQB · 2023-09-20T11:26:20Z

In that case, I think that this is mainly a matter a documenting the right approach in the doc. wdyt?

noklam · 2023-09-20T11:50:01Z

I have marked this as a documentation effort.

I suggest that for the person who pick up this ticket, we can put a note section to warn about the scope of lambda, and check if we can improve our docs example, maybe add this to a faq.

stichbury · 2023-10-19T09:33:02Z

This is something to tackle as part of #2941

noklam added this to Kedro Framework Sep 20, 2023

noklam added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Sep 20, 2023

astrojuanlu changed the title ~~Partitioned Dataset Lazy Saving [bug/doc]~~ Make Partitioned Dataset Lazy Saving example more robust Sep 20, 2023

stichbury mentioned this issue Sep 29, 2023

Revise the advanced data catalog docs about partitioned/incremental datasets #2941

Open

github-actions bot mentioned this issue Oct 1, 2023

Monthly issue metrics report #3100

Closed

stichbury added this to the Improve Kedro documentation used by advanced users milestone Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Partitioned Dataset Lazy Saving example more robust #3052

Make Partitioned Dataset Lazy Saving example more robust #3052

cverluiseQB commented Sep 20, 2023 •

edited by astrojuanlu

Loading

noklam commented Sep 20, 2023

noklam commented Sep 20, 2023 •

edited

Loading

cverluiseQB commented Sep 20, 2023

noklam commented Sep 20, 2023

stichbury commented Oct 19, 2023

Make Partitioned Dataset Lazy Saving example more robust #3052

Make Partitioned Dataset Lazy Saving example more robust #3052

Comments

cverluiseQB commented Sep 20, 2023 • edited by astrojuanlu Loading

Partitioned Dataset Lazy Saving

Problem

Expected Behavior

Example

Fix

Environment

Resources

noklam commented Sep 20, 2023

noklam commented Sep 20, 2023 • edited Loading

cverluiseQB commented Sep 20, 2023

noklam commented Sep 20, 2023

stichbury commented Oct 19, 2023

cverluiseQB commented Sep 20, 2023 •

edited by astrojuanlu

Loading

noklam commented Sep 20, 2023 •

edited

Loading