Dynamically create catalog entries #138

roumail · 2019-10-23T13:28:35Z

Description

I have two python dictionaries that I'm saving locally using kedro.io.PickleLocalDataSet.

These dictionaries are created using the snippet below:

import pickle
import pandas as pd

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])

d1 = {
    'name' : df.sample(2) ,
    'phone' : df.sample(2),
    }
d2 = {"hello" : df}

with open("d1", "wb") as h:
    pickle.dump(d1, h)

with open("d2", "wb") as h:
    pickle.dump(d2, h)

Therefore, my catalog.yml looks like this :

d1:
  type: kedro.io.PickleLocalDataSet
  filepath: ./d1

d2:
  type: kedro.io.PickleLocalDataSet
  filepath: ./d2

I wish there were an easy way to "explode" the two dictionaries, d1 and d2, into CSVLocalDatasets:

name:
  type: kedro.io.CSVLocalDataSet
  filepath: ./name.csv

phone:
  type: kedro.io.CSVLocalDataSet
  filepath: ./phone.csv

hello:
  type: kedro.io.CSVLocalDataSet
  filepath: ./hello.csv

I first tried to do this by creating a node that would read a dictionary and output a list of dataframes using a small function like the following:

def dict2df_list(d):
  return [v for k,v in d.items()]

However I get stuck when trying to specify the config.yml for the above node because I don't know how many dataframes will be generated up front. As seen in the example, d1 has two dataframes as values and d2 has only one dataframe.

It should be clear already but I've been using the YAML file based way of declaring pipelines/nodes/config etc. I understand that using code API is equivalent but I'm not as familiar using that approach to declare pipelines etc.

Context

There are many times when we have a node that creates a list of outputs. We can't always pre-specify how many outputs will be generated. Therefore, some documentation around such use cases would be really helpful.
I'm not exactly sure how to proceed here.

Possible Alternatives

When I encountered a similar situation in the past, I resolved the dilemma by implementing a separate node for each case of the twenty cases I had. Without a looping construct, ofcourse there was a lot of code duplication but at least it worked. I cannot use a similar approach in this case since I don't know the total number of cases up front.

The text was updated successfully, but these errors were encountered:

DmitriiDeriabinQB · 2019-10-23T15:58:59Z

@roumail Thank you for your feedback. Kedro does not support this feature out of box at the moment. You can programmatically modify the DataCatalog and add new datasets to it after instantiating one from YAML, however you would also need to dynamically change a) node definition that produces those datasets, and b) all nodes that consume them. Which leads to somewhat clunky solution.

If this is critical for your use case, we would need more information from you as to what is your particular need for that and why it is not feasible to achieve the same using a single dataset or a constant set of them.

roumail · 2019-10-24T13:09:49Z

Hi @DmitriiDeriabinQB, Thank you for your response. For my particular use case, I was able to fix it in the end by sticking to the YAML API. I was previously trying to mix the two which led to a clunky implementation and confusion.

I guess I would close with a request for more information around the Code vs YAML API. For example, 1) can we mix the Yaml and code API and 2) whether doing so is even a good idea from a design perspective. Based on your last comment it seems we should stick with either one of the api's and not mix the two.

DmitriiDeriabinQB · 2019-10-24T15:30:01Z

This is a good question to ask :)

Generally, I would say, the recommendation is to try to avoid mixing Code vs YAML API together. Most of the time it just boils down to not doing in the code something that YAML already supports (e.g., templating the datasets, passing parameters to nodes, etc.).

However, there are definitely some genuine use cases that are not fully covered by YAML API, so it makes sense to apply some logic in the code if it becomes apparent that there is no other way to get what you want. But in that situation I would personally still refrain from porting all dataset definitions into the code, since 95% of them is just regular configuration. If that makes sense.

lorenabalan · 2020-01-28T14:52:19Z

Closing this as answered, but feel free to re-open/open a new issue if you need further clarification.

roumail added the Issue: Feature Request New feature or improvement to existing feature label Oct 23, 2019

DmitriiDeriabinQB added the Type: Awaiting Information label Oct 23, 2019

roumail closed this as completed Oct 24, 2019

roumail reopened this Oct 24, 2019

DmitriiDeriabinQB removed the Type: Awaiting Information label Oct 25, 2019

tomvigrass pushed a commit to tomvigrass/kedro that referenced this issue Nov 7, 2019

Add a custom "Edit on GitHub" url in the docs (kedro-org#138)

0ea714f

lorenabalan closed this as completed Jan 28, 2020

ruben-s mentioned this issue Feb 23, 2023

RFC: loop pipeline over result other pipeline? #2354

Closed

astrojuanlu mentioned this issue Mar 22, 2023

Kedro-dataset release process kedro-org/kedro-plugins#405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically create catalog entries #138

Dynamically create catalog entries #138

roumail commented Oct 23, 2019 •

edited

Loading

DmitriiDeriabinQB commented Oct 23, 2019 •

edited

Loading

roumail commented Oct 24, 2019

DmitriiDeriabinQB commented Oct 24, 2019

lorenabalan commented Jan 28, 2020

Dynamically create catalog entries #138

Dynamically create catalog entries #138

Comments

roumail commented Oct 23, 2019 • edited Loading

Description

Context

Possible Alternatives

DmitriiDeriabinQB commented Oct 23, 2019 • edited Loading

roumail commented Oct 24, 2019

DmitriiDeriabinQB commented Oct 24, 2019

lorenabalan commented Jan 28, 2020

roumail commented Oct 23, 2019 •

edited

Loading

DmitriiDeriabinQB commented Oct 23, 2019 •

edited

Loading