Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanism for programmatically updating datasets #188

Closed
barkerje opened this issue Dec 16, 2019 · 4 comments
Closed

Mechanism for programmatically updating datasets #188

barkerje opened this issue Dec 16, 2019 · 4 comments

Comments

@barkerje
Copy link

Description

I would like to have an iterative pipeline where a dataset, train_data, gets read in at the beginning of a pipeline and new an updated version of the same train_data dataset is written at the end of the pipeline. I was trying to do this using a versioned dataset so I can track my progress after each iteration. However, when I try to implement this I get:

kedro.pipeline.pipeline.CircularDependencyError: Circular dependencies exist among these items:

Context

I am implementing an active learning pipeline with the following (simplified) workflow:

  1. Read in data from train_data and use to train a model.
  2. Infer on infer_data and get annotator to provide new infer_labels.
  3. Update train_data to include infer_data and infer_labels.
  4. Repeat from step 1.

At present, I cannot figure out any way to do this in the kedro framework due to the restrictions placed on the pipelines. How would you suggest implementing something like this pipeline?

@barkerje barkerje added the Issue: Feature Request New feature or improvement to existing feature label Dec 16, 2019
@DmitriiDeriabinQB
Copy link
Contributor

@barkerje thank you for your message. You could have a look at using a PartitionedDataSet to incrementally update your train data.

CircularDependencyError indicates that you are trying to load and save the data to the same dataset (train_data in your example) within one pipeline, which is not allowed in Kedro. If you need to update your training set as part of the same workflow, you can create a new dataset entry with a different name, but which points to the same physical location of the original training data. Then you can use one dataset for loading at the beginning of your pipeline and then save to it using the second dataset later on. Please note that this workaround is not needed if you save the data before loading it.

Also, could I suggest you to post such kind of "how to" questions on Stackoverflow (and tag it with kedro)? Since other users may also benefit from the answers and, arguably, SO is more suitable for Q&A.

@DmitriiDeriabinQB DmitriiDeriabinQB added Type: Discussion and removed Issue: Feature Request New feature or improvement to existing feature labels Jan 20, 2020
@lorenabalan
Copy link
Contributor

Hi @barkerje if you're happy with the reply above, please consider closing this issue. :)

@yetudada
Copy link
Contributor

Hi @barkerje! We hope that you got sufficient help from @DmitriiDeriabinQB on this issue. I'm going to be closing this issue. Let us know if you have any more thoughts by either commenting on this or creating a new issue.

@seeM
Copy link

seeM commented Jun 27, 2020

Related to the proposal in #341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants