Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch spawn POC #5438

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Batch spawn POC #5438

wants to merge 4 commits into from

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Mar 29, 2023

Zero-th level workaround for #5437

(We still need to optimize the problematic code, but this approach might be useful in the interim, and possibly in the long run as well).

UPDATE: the main problem (n-window computation) was fixed by #5475. But it may still be worth doing this as well.

When an output gets completed, instead of spawning all children into the task pool at once, record what needs to be spawned, and spawn them batch-wise via the main loop.

This plays well with queuing, because queues work with what they've got, so tasks can be released to run throughout the long spawning period.

The example from #5437 is quite usable on this branch, although CPU remains high till spawning is complete, and only the GUI table view is workable (and that with filtering for active tasks):

[task parameters]
   m = 0..6999
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a]]
      script = sleep 10
   [[b<m>]]

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver
Copy link
Member Author

(Tests not happy but I'm pretty sure it'll be something trivial...)

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the best option to me.

Tried it out, the workflow is able to make progress as expected.

I found that client connections were taking longer than expected. Not sure why, a bit of debugging showed that once the requests were received, they were processed quickly by the workflow so it's the connection/security bit which is slow? Possibly the authenticator thread not getting enough CPU?

For the most part connections worked fine, but requests with larger payloads e.g. cylc dump and cylc tui don't work without a major increase in --comms-timeout. Assuming we can reduce the number of increment_graph_window calls this problem should go away and we should be able to up the max number of tasks spawned per main-loop cycle.

@@ -1645,6 +1645,7 @@ async def main_loop(self) -> None:
# Shutdown workflow if timeouts have occurred
self.timeout_check()

self.pool.spawn_children()
Copy link
Member

@oliver-sanders oliver-sanders Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the functional tests can be sensitive to the order of main loop events. Maybe try to relocate this where spawning happened before, near process_queued_task_messages or whatever.

children = itask.graph_children[output]
if forced:
self.tasks_to_spawn_forced[
(itask, output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(itask, output, forced) to avoid needing two dicts?

Possibly consider using a deque or list rather than dict unless there's a need to perform in comparisons.

@oliver-sanders
Copy link
Member

FYI: I've bumped into a workflow today which would benefit from this change even after the increment_graph_window efficiency improvements.

@hjoliver
Copy link
Member Author

hjoliver commented May 8, 2023

OK, I'll resurrect this and try to finish it off soon.

@hjoliver
Copy link
Member Author

hjoliver commented May 11, 2023

Damn it, I've just realized a problem with this approach. 😡

Sometimes we need to update the prerequisites of already-spawned tasks. On this branch, it's possible for those tasks to queued for spawning by the main loop and hence not actually spawned yet. So for this to work, we'll need a more complex data-structure that records tasks-to-be-spawned and prerequisites to update immediately when they are spawned.

[UPDATE] Actually it's more subtle than that. I was already recording children-to-be-spawned against the spawning outputs, and updating that prerequisite at spawn-time; the problem is there can be multiple such outputs, so the mapping really needs to be the other way round (child ->list-of-parent-outputs, with the list can grow over time before spawning)

[UPDATE 2] Got it working, but major tidying needed before I push it up...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants