-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Runners, introduce Task
class
#4206
Conversation
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
# Conflicts: # kedro/runner/runner.py
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
kedro/runner/sequential_runner.py
Outdated
@@ -75,21 +75,22 @@ def _run( | |||
|
|||
for exec_index, node in enumerate(nodes): | |||
try: | |||
run_node(node, catalog, hook_manager, self._is_async, session_id) | |||
from kedro.runner.task import Task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because I refactored run_node
in the runner to use Task
and moved methods to Task
as well. run_node
isn't actually needed anymore, but removing it would be a breaking change. I could undo the changes to runner.py
which removes the import from Task
and then allows it to be imported inside the runner implementations again. The downside is that we'd have duplicated code in runner.py
and task.py
.
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Is there a relevant issue for this? Nothing against the idea of introducing the "task" abstraction; just interested to better understand what motivates it. |
I'll update the description when the PR is ready for review. |
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Task
Task
class
In general not much concern the refactoring looks simple enough and is a better abstraction than the previous one. I would like to run the benchmark once #4210 is ready to make sure we don't run into memory/perf issue. Side note: I think the current |
…ional argument, and adding parallel as boolean flag Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much cleaner now, thank you @merelcht!
Left one minor suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good, I still want to wait #4210 to benchmark the runner if it's not too urgent.
Going through the runner code again, I have some thought (not specific to the refactoring).
Q1: With modern Python like asyncio
, the first question I have is are we really doing async
in Kedro? I see most of the async
references are associated with threading, but I don't think they are the same thing. This maybe something that we could consider and I think in general async
is simpler and was designed exactly for I/O taskes. Things may be a bit tricky since Python 3.13 start having a GIL free thread.
Q2: is_async
in SequentialRunner is only for a limited scope, i.e. if nodes has mulitiple dataset, the order of loading doesn't matter and can be loaded in an async manner (Why isn't it the default already?)
Q3: There is another level of async, which is at multi-node level where each node is being executed asynchronously, again async
maybe a simpler solution.
Yes of course! We can definitely wait for that.
These are excellent points, thanks @noklam ! I haven't had time to continue refactoring, but I will definitely take this on as part of it. |
@noklam I don't know if I did this correct by I ran
Then I did
So overall it looks like there isn't really a difference in performance. From the pipeline test: https://github.com/kedro-org/kedro/actions/runs/11594561241/job/32281088073?pr=4206 you can also see there's hardly a difference. |
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Description
Introduced the
Task
class, which encapsulates what is actually run in each of the runners to make theRunners
code more readable.I've always found the code in runners a bit hard to navigate. The (simplified) flow before my refactor for running a node was:
Now it's:
Development notes
Task
which contains all that is needed to execute aNode
._release_datasets()
to remove duplicated code across the runners.runner.py
to theTask
class.parallel_runner.py
totask.py
run_node()
as deprecated, because it's replaced byTask.execute()
and no longer called directly anywhere other than tests.Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file