Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data store: remove threading #574

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Mar 28, 2024

Closes #194

Run subscribers via asyncio rather than the ThreadPoolExecutor.

I think the only non-blocking operations in the code being called were:

  • time.sleep (has an async variant)
  • asyncio.sleep (async)
  • self.socket.recv_multipart (async)

If so, I don't think we need to use threading here so I've refactored the code so that the underlying async functions could be called via asyncio.

If so, this cuts out the need for the ThreadPoolExecutor removing the workflow limit.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver
Copy link
Member

hjoliver commented Apr 2, 2024

Nice, hope we can get this working 🚀

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Apr 3, 2024

Ping @dwsutherland, I may well have overlooked something, does this make sense to you?

(note, tests are unhappy for async reasons, but it should work for real usage)

@dwsutherland
Copy link
Member

dwsutherland commented Apr 5, 2024

Ping @dwsutherland, I may well have overlooked something, does this make sense to you?

(note, tests are unhappy for async reasons, but it should work for real usage)

Will have a look, I think I wanted to do this when I was first building, but didn't manage to figure out how to have non-blocking subscription receive loops or something.. Can't remember exactly why (wasn't as simple as the sleeps..), but well done if you've figured it out 👏

@dwsutherland
Copy link
Member

dwsutherland commented Apr 9, 2024

The changes look quite simple .. had a play, and I did notice something I couldn't reproduce on master.
When stopping a workflow in the WUI (latest master cylc-ui):
image

[I 2024-04-09 21:37:20.380 CylcUIServer] $ cylc play --color=never --mode live five/run1
[I 2024-04-09 21:37:20.901 CylcHubApp log:191] 200 POST /user/sutherlander/cylc/graphql (sutherlander@::ffff:127.0.0.1) 530.46ms
[I 2024-04-09 21:37:20.906 CylcUIServer] [data-store] connect_workflow('~sutherlander/five/run1', <dict>)
[I 2024-04-09 21:37:30.551 CylcHubApp log:191] 200 POST /user/sutherlander/cylc/graphql (sutherlander@::ffff:127.0.0.1) 99.47ms
[I 2024-04-09 21:37:30.943 CylcUIServer] [data-store] disconnect_workflow('~sutherlander/holder/run1')
21:37:31.535 [ConfigProxy] error: 503 GET /user/sutherlander/cylc/subscriptions connect ECONNREFUSED 127.0.0.1:33077
[I 2024-04-09T21:37:31.541 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsutherlander%2Fcylc%2Fsubscriptions (@127.0.0.1) 5.35ms
21:37:31.894 [ConfigProxy] error: 503 GET /user/sutherlander/cylc/subscriptions connect ECONNREFUSED 127.0.0.1:33077
.
.
.
[I 2024-04-09T21:37:34.955 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsutherlander%2Fcylc%2Fsubscriptions (@127.0.0.1) 1.33ms
21:37:37.405 [ConfigProxy] error: 503 GET /user/sutherlander/cylc/subscriptions connect ECONNREFUSED 127.0.0.1:33077
[I 2024-04-09T21:37:37.407 JupyterHub log:191] 200 GET /hub/error/503?url=%2Fuser%2Fsutherlander%2Fcylc%2Fsubscriptions (@127.0.0.1) 1.35ms
[W 2024-04-09T21:37:39.146 JupyterHub base:1154] User sutherlander server stopped, with exit code: 0
[I 2024-04-09T21:37:39.146 JupyterHub proxy:356] Removing user sutherlander from proxy (/user/sutherlander/)

Not sure why.. could be something unrelated? maybe this branch needs rebased..

Looks like it might be trying to reconnect to the disconnected workflow?

@dwsutherland
Copy link
Member

dwsutherland commented Apr 9, 2024

Ah, think I might know why, maybe .. the subscriber resolver (cylc-flow) might use that w_subs variable, which you changed to subscribers but workflow_subscribers would have been better I think

Comment on lines +210 to 214
except WorkflowStopped:
self.disconnect_workflow(w_id)
except Exception as exc:
self.log.error(f'Failed to connect to {w_id}: {exc}')
self.disconnect_workflow(w_id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved error handling up a level from the _start_subscription and _entire_workflow_update methods, no more threads to interfere with error reporting.

* Closes cylc#194
* Run subscribers via asyncio rather than the ThreadPoolExecutor.
* The only non-blocking operations in the code being called were:
  - time.sleep  (has an async variant)
  - asyncio.sleep (async)
  - self.socket.recv_multipart (async)
* Refactored the code so that the underlying async functions could be
  called via asyncio.
* Move error handling up a level into `connect_workflow` from the
  `_start_subscription` and `_entire_workflow_update` methods.
* Simplify tests (all exceptions are now caught in the same way).
* Remove the multi-workflow handling ability of
  `_entire_workflow_update`, this is unused and can now be achieved more
  easily via asyncio.gather as the threadding has been removed.
@oliver-sanders oliver-sanders marked this pull request as ready for review April 22, 2024 15:24
@oliver-sanders oliver-sanders self-assigned this Apr 22, 2024
@oliver-sanders oliver-sanders added this to the 1.5.0 milestone Apr 22, 2024
Copy link
Member

@dwsutherland dwsutherland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting the same thing happening, when stopping a workflow (from UI of CLI):
image

When using hubless, the UIS just disconnects and process stops:

.
.
.
[I 2024-05-06 19:14:36.301 CylcUIServer] $ cylc play --color=never --mode live five/run1
[I 2024-05-06 19:14:36.834 CylcUIServer] [data-store] connect_workflow('~sutherlander/five/run1', <dict>)
[I 2024-05-06 19:14:49.404 CylcUIServer] [data-store] disconnect_workflow('~sutherlander/five/run1')
(uiserver) sutherlander@cortex-vbox:cylc-uiserver$ 

So it appears the UIS stops or crashes when a workflow is stopped/disconnected.

@oliver-sanders oliver-sanders marked this pull request as draft May 10, 2024 10:03
@oliver-sanders oliver-sanders modified the milestones: 1.5.0, 1.6.0 May 10, 2024
@dwsutherland
Copy link
Member

dwsutherland commented Jun 10, 2024

Does it make sense to deal with this one:
#584
here, at the same time?

@oliver-sanders
Copy link
Member Author

Can move away from Tornado async interfaces in due course (these are already translated into asyncio calls by the tornado library), Jupyter Server hasn't moved over yet so we may need to wait a while before changing ourselves.

@oliver-sanders oliver-sanders modified the milestones: 1.6.0, 1.7.0 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit use of multiple subscription sockets/threads
4 participants