Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip node at runtime #2410

Open
sbrugman opened this issue Mar 10, 2023 · 5 comments
Open

Skip node at runtime #2410

sbrugman opened this issue Mar 10, 2023 · 5 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@sbrugman
Copy link
Contributor

sbrugman commented Mar 10, 2023

There currently is no way (that I know of) to skip a node at runtime (e.g. from a hook), without failing the pipeline run.

Is there already an idiomatic way of doing so? e.g. build a custom runner, with a function similar to run_only_missing?

Alternatives considered:

  • Overwriting the function with a no-op will still save continue with saving the dataset, which should be avoided in that case.
  • Removing the node might might have unintended side-effects
  • Alternatively, the pipeline could be built dynamically. A downside is that the hooks abstraction cannot be used then (for determining to skip a node), so possibly this has much boilerplate/overhead.

If not, is this something that is welcome to be contributed? It could be a fairly simple and generic addition. (Happy to add)

(related to #2307)

@sbrugman sbrugman added the Issue: Feature Request New feature or improvement to existing feature label Mar 10, 2023
@datajoely
Copy link
Contributor

How would you like this to work if it existed? Is it based on a condition or is it known pre-run?

@sbrugman
Copy link
Contributor Author

sbrugman commented Mar 10, 2023

What would work well in the case above is that a node can be skipped (simply a boolean flag), that can be set in the before_node_run. (Might need some extra thinking)

Indeed based on a condition only known at runtime. In the referenced issue this would be a cache hit, however I can imagine use cases with other conditions.

(Note that Github Actions, Azure DevOps pipelines and related tools do support this and could be a source of inspiration.)

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 13, 2023
@antonymilne
Copy link
Contributor

antonymilne commented Mar 14, 2023

I think there's going to be people who disagree (e.g. @idanov) but personally I like this idea and think doing it with hooks feels very natural. before_node_run already enables some "advanced" behaviour where you can return a dictionary to dynamically override node inputs. We could also make some sentinel value SKIP, and if the hook returns that value then skip execution of the node.

Three other ideas that are already possible but I suspect won't offer the full flexible dynamic functionality you'd like. They could also be used in combination:

  1. Like you suggest, take the code from run_only_missing and use it to define your own custom runner. Put this in <project-name>/src/<python_package>/runner.py and then do kedro run --runner=<python_package>.runner.MissingOnlySequentialRunner
  2. Use Pipeline.filter in your pipeline_registry.py to register a pipeline skip_nodes and then run with kedro run -p skip_nodes.
  3. The no-op idea: the key to getting this working I think would be to override node.run and not node.func as you might expect. Take a look at https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d and the discussion it came from DVC Plugin to skip Nodes if Data and Code are up to date #837.

@Sm1Ling
Copy link

Sm1Ling commented Nov 1, 2023

Hi?
Has anyone taken this feature?)
Looking forward for it

Options:

  • Cache configs\hashes of configs of upper nodes. Compare them for each launch
  • Compare whether output dataset of node alreay exists (for instance, i name datasets task-wise. Different task will have different dataset naming. Same task will have same settings for pipeline)
  • Cache names of upper nodes source files, cache their memory size (cache another proxies of being changed). And compare with current launch data

@sbrugman
Copy link
Contributor Author

Our team is working on a kedro runner for this. PyCodeHash was just released and solves the heavy lifting of hashing functions and datasets.

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Jul 8, 2024
@merelcht merelcht added this to the Something about Runners milestone Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

5 participants