Mechanism for programmatically updating datasets #188

barkerje · 2019-12-16T23:15:14Z

Description

I would like to have an iterative pipeline where a dataset, train_data, gets read in at the beginning of a pipeline and new an updated version of the same train_data dataset is written at the end of the pipeline. I was trying to do this using a versioned dataset so I can track my progress after each iteration. However, when I try to implement this I get:

kedro.pipeline.pipeline.CircularDependencyError: Circular dependencies exist among these items:

Context

I am implementing an active learning pipeline with the following (simplified) workflow:

Read in data from train_data and use to train a model.
Infer on infer_data and get annotator to provide new infer_labels.
Update train_data to include infer_data and infer_labels.
Repeat from step 1.

At present, I cannot figure out any way to do this in the kedro framework due to the restrictions placed on the pipelines. How would you suggest implementing something like this pipeline?

The text was updated successfully, but these errors were encountered:

DmitriiDeriabinQB · 2020-01-20T09:52:43Z

@barkerje thank you for your message. You could have a look at using a PartitionedDataSet to incrementally update your train data.

CircularDependencyError indicates that you are trying to load and save the data to the same dataset (train_data in your example) within one pipeline, which is not allowed in Kedro. If you need to update your training set as part of the same workflow, you can create a new dataset entry with a different name, but which points to the same physical location of the original training data. Then you can use one dataset for loading at the beginning of your pipeline and then save to it using the second dataset later on. Please note that this workaround is not needed if you save the data before loading it.

Also, could I suggest you to post such kind of "how to" questions on Stackoverflow (and tag it with kedro)? Since other users may also benefit from the answers and, arguably, SO is more suitable for Q&A.

lorenabalan · 2020-01-28T14:55:14Z

Hi @barkerje if you're happy with the reply above, please consider closing this issue. :)

yetudada · 2020-05-20T10:23:00Z

Hi @barkerje! We hope that you got sufficient help from @DmitriiDeriabinQB on this issue. I'm going to be closing this issue. Let us know if you have any more thoughts by either commenting on this or creating a new issue.

seeM · 2020-06-27T14:58:15Z

Related to the proposal in #341

barkerje added the Issue: Feature Request New feature or improvement to existing feature label Dec 16, 2019

DmitriiDeriabinQB added Type: Discussion and removed Issue: Feature Request New feature or improvement to existing feature labels Jan 20, 2020

yetudada closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanism for programmatically updating datasets #188

Mechanism for programmatically updating datasets #188

barkerje commented Dec 16, 2019

DmitriiDeriabinQB commented Jan 20, 2020

lorenabalan commented Jan 28, 2020

yetudada commented May 20, 2020

seeM commented Jun 27, 2020

Mechanism for programmatically updating datasets #188

Mechanism for programmatically updating datasets #188

Comments

barkerje commented Dec 16, 2019

Description

Context

DmitriiDeriabinQB commented Jan 20, 2020

lorenabalan commented Jan 28, 2020

yetudada commented May 20, 2020

seeM commented Jun 27, 2020