Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically create catalog entries #138

Closed
roumail opened this issue Oct 23, 2019 · 4 comments
Closed

Dynamically create catalog entries #138

roumail opened this issue Oct 23, 2019 · 4 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@roumail
Copy link

roumail commented Oct 23, 2019

Description

I have two python dictionaries that I'm saving locally using kedro.io.PickleLocalDataSet.

These dictionaries are created using the snippet below:

import pickle
import pandas as pd

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                   index=['falcon', 'dog', 'spider', 'fish'])

d1 = {
    'name' : df.sample(2) ,
    'phone' : df.sample(2),
    }
d2 = {"hello" : df}

with open("d1", "wb") as h:
    pickle.dump(d1, h)

with open("d2", "wb") as h:
    pickle.dump(d2, h)

Therefore, my catalog.yml looks like this :

d1:
  type: kedro.io.PickleLocalDataSet
  filepath: ./d1

d2:
  type: kedro.io.PickleLocalDataSet
  filepath: ./d2

I wish there were an easy way to "explode" the two dictionaries, d1 and d2, into CSVLocalDatasets:

name:
  type: kedro.io.CSVLocalDataSet
  filepath: ./name.csv

phone:
  type: kedro.io.CSVLocalDataSet
  filepath: ./phone.csv

hello:
  type: kedro.io.CSVLocalDataSet
  filepath: ./hello.csv

I first tried to do this by creating a node that would read a dictionary and output a list of dataframes using a small function like the following:

def dict2df_list(d):
  return [v for k,v in d.items()]

However I get stuck when trying to specify the config.yml for the above node because I don't know how many dataframes will be generated up front. As seen in the example, d1 has two dataframes as values and d2 has only one dataframe.

It should be clear already but I've been using the YAML file based way of declaring pipelines/nodes/config etc. I understand that using code API is equivalent but I'm not as familiar using that approach to declare pipelines etc.

Context

There are many times when we have a node that creates a list of outputs. We can't always pre-specify how many outputs will be generated. Therefore, some documentation around such use cases would be really helpful.
I'm not exactly sure how to proceed here.

Possible Alternatives

When I encountered a similar situation in the past, I resolved the dilemma by implementing a separate node for each case of the twenty cases I had. Without a looping construct, ofcourse there was a lot of code duplication but at least it worked. I cannot use a similar approach in this case since I don't know the total number of cases up front.

@roumail roumail added the Issue: Feature Request New feature or improvement to existing feature label Oct 23, 2019
@DmitriiDeriabinQB
Copy link
Contributor

DmitriiDeriabinQB commented Oct 23, 2019

@roumail Thank you for your feedback. Kedro does not support this feature out of box at the moment. You can programmatically modify the DataCatalog and add new datasets to it after instantiating one from YAML, however you would also need to dynamically change a) node definition that produces those datasets, and b) all nodes that consume them. Which leads to somewhat clunky solution.

If this is critical for your use case, we would need more information from you as to what is your particular need for that and why it is not feasible to achieve the same using a single dataset or a constant set of them.

@roumail
Copy link
Author

roumail commented Oct 24, 2019

Hi @DmitriiDeriabinQB, Thank you for your response. For my particular use case, I was able to fix it in the end by sticking to the YAML API. I was previously trying to mix the two which led to a clunky implementation and confusion.

I guess I would close with a request for more information around the Code vs YAML API. For example, 1) can we mix the Yaml and code API and 2) whether doing so is even a good idea from a design perspective. Based on your last comment it seems we should stick with either one of the api's and not mix the two.

@roumail roumail closed this as completed Oct 24, 2019
@roumail roumail reopened this Oct 24, 2019
@DmitriiDeriabinQB
Copy link
Contributor

This is a good question to ask :)

Generally, I would say, the recommendation is to try to avoid mixing Code vs YAML API together. Most of the time it just boils down to not doing in the code something that YAML already supports (e.g., templating the datasets, passing parameters to nodes, etc.).

However, there are definitely some genuine use cases that are not fully covered by YAML API, so it makes sense to apply some logic in the code if it becomes apparent that there is no other way to get what you want. But in that situation I would personally still refrain from porting all dataset definitions into the code, since 95% of them is just regular configuration. If that makes sense.

@lorenabalan
Copy link
Contributor

Closing this as answered, but feel free to re-open/open a new issue if you need further clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants