Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawning + datastore with thousands of "ready" tasks #5437

Closed
hjoliver opened this issue Mar 29, 2023 · 1 comment
Closed

Spawning + datastore with thousands of "ready" tasks #5437

hjoliver opened this issue Mar 29, 2023 · 1 comment
Labels
bug Something is wrong :( duplicate This is a duplicate of something else

Comments

@hjoliver
Copy link
Member

Not strictly a bug, but some extreme optimization is required in Cylc 8 for pathological workflows where a huge number of tasks hit the active window at once.

Here the scheduler has to spawn 7000 tasks all at once, off of a:succeed, which takes ~5 minutes on my fairly powerful laptop, during which time the scheduler is unresponsive.

[task parameters]
   m = 0..6999  # !!! 
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"  # !!!
[runtime]
   [[a]]
      script = sleep 10
   [[b<m>]]

Initial profiling results from @oliver-sanders shows:

  1. primarily, the datastore n-window computation is responsible
  2. secondarily (much less time), each spawned tasks needs a database read to get submit number and flow info

Ok, I got bored after 20mins or so and cut the run off at that point. FYI, if you ctrl+c your workflow, the profile.prof file still gets generated.

  • The spawn_on_output function itself took 0.1562s.
  • The increment_graph_window function in the data store took 1139s (including its resulting calls).

So it's the data store not the task pool. The increment_graph_window function was called 4'325 times, but called itself recursively 30'276'650 times which is where the CPU gets soaked up.

This is more-or-less as expected, we knew this function was being called more times than necessary, see this comment - #5319 (comment)
data-store graph window efficiency/refactor by dwsutherland · Pull Request #5319 · cylc/cylc-flow - GitHub
Partially addresses #5315 Supersedes #5316 Check List I have read CONTRIBUTING.md and added my name as a Code Contributor. Contains logically grouped changes (else tidy your branch by rebase). ...

Two suggestions:

  • If possible, batch together the increment_graph_window / task spawning to reduce the number of top-level calls to increment_graph_window.
  • Would require heavy refactoring, the function is designed to expand the graph around one task at a time.
  • Potential savings ~4000x

Come up with a more efficient approach to increment_graph_window.

  • I.E. Remember which nodes we have already visited to avoid repeat visits.
  • Potential savings somewhere between 750x and 43'000x depending on the impact of batching.

The end result of these increment_graph_window calls is 30,392,831 detokenise calls, but that's not really the culprit here. There are 7000 tasks and 7000 dependencies so there's only call for 14000 detokenise calls, so we're calling the interface ~2000 times more than we should be. If we can make detokenise faster, great, but reducing the number of calls is where the order of magnitude improvements we need will come from.

@hjoliver hjoliver added the bug Something is wrong :( label Mar 29, 2023
@hjoliver hjoliver mentioned this issue Mar 29, 2023
8 tasks
@oliver-sanders oliver-sanders added this to the cylc-8.2.0 milestone Mar 29, 2023
@hjoliver hjoliver removed this from the cylc-8.2.0 milestone Apr 26, 2023
@hjoliver hjoliver added the duplicate This is a duplicate of something else label Apr 26, 2023
@hjoliver
Copy link
Member Author

Closing as a duplicate of #5435 (although I think this was the original)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( duplicate This is a duplicate of something else
Projects
None yet
Development

No branches or pull requests

2 participants