Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the default behavior of run to run_only_missing #55

Closed
Minyus opened this issue Jul 12, 2019 · 1 comment
Closed

Change the default behavior of run to run_only_missing #55

Minyus opened this issue Jul 12, 2019 · 1 comment
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@Minyus
Copy link
Contributor

Minyus commented Jul 12, 2019

Description

It is more intuitive that run avoids re-computation of nodes in default leaving the current behavior as force_rerun option.

Context

A major motivation to save the intermediate files, even if it consumes the disk space and requires additional computation time, is to avoid re-computation of the nodes. Thus, it is more intuitive to utilize the saved files in default as implemented in run_only_missing in the current version of kedro.

This suggestion is related to #30 and #25 .

Possible Implementation

Modify run

@Minyus Minyus added the Issue: Feature Request New feature or improvement to existing feature label Jul 12, 2019
@idanov
Copy link
Member

idanov commented Jul 16, 2019

Hi @Minyus , thank you for opening an issue about this. As you pointed out run_only_missing could be very useful during development, however during production the operating mode would be to recompute the data on every run, since the raw data is expected to change. Otherwise the pipeline will eventually end up not doing anything and the results will be stale.

When an MLOps or DevOps person tries to run the pipeline in production, they would expect the pipeline to be run with the default option of a command like kedro run. It would be quite easy for them to miss the suggested --force-rerun flag and they might end up deploying a pipeline which will work only once and never again, since none of the data will be missing after a run.

Therefore the decision to run only missing is relevant only during development and can be done with an optional flag by the developer of the pipeline, since they are much more knowledgable about kedro than an MLOps or DevOps person deploying the code in production.

Even if #30 is added as an option, we would very likely keep it as an optional flag rather than default behaviour. I will close the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

2 participants