Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Partial/Custom node ordering for SequentialRunner #3717

Closed
noklam opened this issue Mar 15, 2024 · 7 comments
Closed

Proposal for Partial/Custom node ordering for SequentialRunner #3717

noklam opened this issue Mar 15, 2024 · 7 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Mar 15, 2024

Description

Support custom / partial node order where user desired.

Background

Kedro offers 3 Runners out of the box (SequentialRunner, ParallelRunner and ThreadRunner), there is another SoftFailRunner which I have implemented and can be installed in https://pypi.org/project/kedro-softfail-runner/.

In the past we have focus to make consistent support for runners and are reluctant to introduce feature parity. In fact, runners has been mostly unchanged for years. Improve resume pipeline suggestion for SequentialRunner introduce a concept similar to "Change Data Capture" (CDC), which fixed the broken suggestion and started to consider about persisted data and the closest checkpoint to recover a failed pipeline. This is the most obvious feature parity among runners as it only support SequentialRunner due to the non-deterministic nature of parallel computing.

Context

TBD

  • Rethink how Kedro can play a role in multiprocessing / performance boost #3713 - I started to think more about runner recently, and my feeling is that SequentialRunner is the most important one, Kedro doesn't play an important role in terms of helping user to get code executed in a parallel fashion. This problem is usually solved by the 3rd party library, for example, polars support multi-core computing out of the box.

This proposal will only support SequentialRunner, which I am increasingly more comfortable with, details are discussed in #3713.

Design

Toposort will give ONE feasible solution, while there can be multiple possible solutions. In some case:

  • A -> B -> C -> D
  • A -> B -> D -> C
    In terms of computation, both pipelines will have identical result if C & D doesn't depend on each other. However, for business logic or just ease of understanding, some ordering may be preferred. (Think about large pipeline like https://demo.kedro.org/?pipeline_id=__default__, how we perceive the execution order is usually from top-to-bottom, this is not necessary how nodes are executed)

Requirements:

  • Non-breaking
  • Validation of ordering, it cannot violate toposort result.

Nice to have:

  • Don't need to introduce a tons of new API
  • Don't need to change how Kedro resolve execution order fundamentally (keep toposort)

Feature:

  • Provide an argument to support custom ordering.

Next step is providing an user friendly API, as this is likely still too low-level for the end user.

High level Proposal

Add new constrains to "node_dependencies" addition to the existing inputs/outputs pair.

Possible Implementation

Can't think of anything, thus dummy outputs has been the workaround for years.

Possible Alternatives

Current workaround involves dummy inputs outputs which become tedious quickly.

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Mar 15, 2024
@noklam noklam changed the title [Draft] - Proposal for Partial node ordering with SequentialRunner [Draft] - Proposal for Partial/Custom node ordering for SequentialRunner Mar 15, 2024
@datajoely
Copy link
Contributor

How would you imagine this work?

@noklam
Copy link
Contributor Author

noklam commented Mar 15, 2024

feature branch: @datajoely
main...noklam/node-ordering-proposal

Don't focus on the API, it's hacky and there is many random Pipeline call in the current process so I have to patch that everywhere.

Demo Project:
kedro-partial-node-order

comment or uncomment this line to play with it.

https://github.com/noklam/kedro-partial-node-order/blob/75b2febb45ab3a44f93928d9d0796bb9d9765ef7/src/ls/pipelines/data_processing/pipeline.py#L40

There are many way to build a nicer API, for example:

  • we can assume the order is following the list.
  • or we can use node_name
  • or we can introduce syntax like airflow etc, i.e. node_a >> node_e

It's not too important to decide this now, it's more of a PoC to prove this is possible. Benefit of this is Kedro viz won't see this at all (I think)

We can also assume it always follow the order of declaration, or try to stick with it as much as possible. Those are just design decision so I will delay that discussion.

@datajoely
Copy link
Contributor

Okay so I understand what you're trying to do. But have some questions.

  1. Why is asking users to pass dummy datasets between nodes insufficient?
  2. Is a deterministic sort (something like a seed) order a more useful enhancement?
  3. This feels like something that should happen at a orchestor / namespace granularity level first, so would address some of the recommendations in Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094 first.

@noklam
Copy link
Contributor Author

noklam commented Mar 15, 2024

  1. Why is asking users to pass dummy datasets between nodes insufficient?
    Few reasons:
  • It contaminate the DAG flowchart (kedro-viz)
  • It's hard to edit or read custom execution order, users will need to go through nodes and try to do that matching in their head or edit multiple file.
  • It feels very hacky.
  • It's annoying that you cannot make Kedro run sequentially (or at least follow the declaration order when it's possible)
  1. Is a deterministic sort (something like a seed) order a more useful enhancement?

If you run with SequenetialRunner now, it's deterministic already #1604. Not sure if this is what you are talking about.

  1. This feels like something that should happen at a orchestor / namespace granularity level first, so would address some of the recommendations in Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094 first.

Which recommendation are you referring to? there are many mentioned.

@noklam noklam changed the title [Draft] - Proposal for Partial/Custom node ordering for SequentialRunner Proposal for Partial/Custom node ordering for SequentialRunner Mar 15, 2024
@datajoely
Copy link
Contributor

  1. You've sold me
  2. Wasn't aware this was fixed, so yeah your prioritization is right.
  3. I think the user provided session ID is the only one I'd put ahead of this.

@noklam
Copy link
Contributor Author

noklam commented Mar 18, 2024

@datajoely 2. is one the thing that annoys me before I joined thus the first thing I fixed :P (also requested by former colleague), CacheDataset is the other one that I still haven't fixed.

@datajoely
Copy link
Contributor

We'll get there 🚀

@kedro-org kedro-org locked and limited conversation to collaborators Mar 28, 2024
@noklam noklam converted this issue into discussion #3758 Mar 28, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants