-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch spawn POC #5438
base: master
Are you sure you want to change the base?
Batch spawn POC #5438
Conversation
(Tests not happy but I'm pretty sure it'll be something trivial...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the best option to me.
Tried it out, the workflow is able to make progress as expected.
I found that client connections were taking longer than expected. Not sure why, a bit of debugging showed that once the requests were received, they were processed quickly by the workflow so it's the connection/security bit which is slow? Possibly the authenticator thread not getting enough CPU?
For the most part connections worked fine, but requests with larger payloads e.g. cylc dump
and cylc tui
don't work without a major increase in --comms-timeout
. Assuming we can reduce the number of increment_graph_window
calls this problem should go away and we should be able to up the max number of tasks spawned per main-loop cycle.
@@ -1645,6 +1645,7 @@ async def main_loop(self) -> None: | |||
# Shutdown workflow if timeouts have occurred | |||
self.timeout_check() | |||
|
|||
self.pool.spawn_children() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of the functional tests can be sensitive to the order of main loop events. Maybe try to relocate this where spawning happened before, near process_queued_task_messages or whatever.
children = itask.graph_children[output] | ||
if forced: | ||
self.tasks_to_spawn_forced[ | ||
(itask, output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(itask, output, forced)
to avoid needing two dicts?
Possibly consider using a deque or list rather than dict unless there's a need to perform in
comparisons.
FYI: I've bumped into a workflow today which would benefit from this change even after the |
OK, I'll resurrect this and try to finish it off soon. |
Damn it, I've just realized a problem with this approach. 😡 Sometimes we need to update the prerequisites of already-spawned tasks. [UPDATE] Actually it's more subtle than that. I was already recording children-to-be-spawned against the spawning outputs, and updating that prerequisite at spawn-time; the problem is there can be multiple such outputs, so the mapping really needs to be the other way round (child ->list-of-parent-outputs, with the list can grow over time before spawning) [UPDATE 2] Got it working, but major tidying needed before I push it up... |
Zero-th level workaround for #5437
(We still need to optimize the problematic code, but this approach might be useful in the interim, and possibly in the long run as well).
UPDATE: the main problem (n-window computation) was fixed by #5475. But it may still be worth doing this as well.
When an output gets completed, instead of spawning all children into the task pool at once, record what needs to be spawned, and spawn them batch-wise via the main loop.
This plays well with queuing, because queues work with what they've got, so tasks can be released to run throughout the long spawning period.
The example from #5437 is quite usable on this branch, although CPU remains high till spawning is complete, and only the GUI table view is workable (and that with filtering for active tasks):
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.