Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFS Data Store Sync #95

Merged
merged 5 commits into from
Dec 3, 2019
Merged

Conversation

dwsutherland
Copy link
Member

@dwsutherland dwsutherland commented Oct 18, 2019

These changes partially address cylc/cylc-flow#3207

This branch is the companion of and needs to be merged after cylc/cylc-flow#3389
(CI will fail until the PR is merged)

Functional/Done:

  • Updates to the data-store via subscriptions to workflow published deltas.
  • Verification of data-store, and reconciliation with the workflow data-store.

To Do: (excluding below)

  • Investigate alternative to threading the subscription/sync loops?

The sync appears solid; verification seldom fails (I haven't seen one yet), and there's reconciliation on failure..

  • The data-store is split-up-into/published-as "topics" (i.e. task_proxies, jobs, workflow...etc), and these topic messages/deltas/updates are subscribed-to/received, processed, and verified independently. This means update frequency can differ between topics, while order and content is orchestrated by the workflow service (WFS).
  • The subscription is started first (in a separate thread), and messages wait in the zmq Queue for initial data request (for ~3sec, then moves onto the next and waits again).
  • The entire workflow is requested in the main thread.
  • The subscription message/delta handler then works through the queue (via recv_multi ZeroMQ holds messages up to the High Water Mark (HWM)), only applying deltas that were created after the initial data and then after last applied delta.
  • The local data-store topic elements (i.e. task_proxies) are verified against the respective WFS, via a checksum sent with the delta, and a new set is requested on failure.

(deltas only contain the elements and fields that were updated)

We need to figure out if creating threads for each subscription loop (i.e. one per workflow), is scalable.. As in, do they yield execution to other threads while awaiting a publish.. I did it to avoid blocking the main asyncio loop with subscription loops.. Sockets can't be passed between threads, but I think it's fine for them to exist in their lifetime on one thread. (will see what happens with a large load of workflows)
Perhaps if we can dynamically update (add/remove) an asyncio.gather (or something), then we could avoid threads?...

But for now it's fully functional! 🎉

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes.
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation update required.

@dwsutherland dwsutherland added this to the 0.2 milestone Oct 18, 2019
@dwsutherland dwsutherland self-assigned this Oct 18, 2019
Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwsutherland so far all good, I just need a bit more of time with data_mgr.py to understand how it is going to work (might need it for subscriptions I guess). So will finish review either tonight or tomorrow 👍

cylc/uiserver/workflows_mgr.py Show resolved Hide resolved
cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
@dwsutherland
Copy link
Member Author

@dwsutherland so far all good, I just need a bit more of time with data_mgr.py to understand how it is going to work (might need it for subscriptions I guess). So will finish review either tonight or tomorrow 👍

There's not a lot of comments or docstrings in this one yet.. Hopefully I'll push that tonight.

@kinow
Copy link
Member

kinow commented Nov 12, 2019

Thanks for the code comments you just added @dwsutherland ! Had a quick read, but will read it with ☕ tomorrow.

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 12, 2019

Thanks for the code comments you just added @dwsutherland ! Had a quick read, but will read it with ☕️ tomorrow.

I have no idea how to unit test the data sync... would have to recreate a workflow (like we do in cylc-flow unittests), or create a lot of fake data (which would be annoying to update) and do a publish of this...

@kinow
Copy link
Member

kinow commented Nov 12, 2019

Thanks for the code comments you just added @dwsutherland ! Had a quick read, but will read it with coffee tomorrow.

I have no idea how to unit test the data sync... would have to recreate a workflow, or create a lot of fake data (which would be annoying to update) and do a publish of this...

This one might be easier to leave with no test for now, or wait until we add either some sort of functional tests here or a cross project functional test (I think we have an issue somewhere).

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dwsutherland the comments are really helpful. Finished reading the rest of the code, and didn't find anything wrong. Tested with Cylc UI, running two workflows, had no issues as well. +1

Reconciliation on failed verification is done by requesting all elements of a
topic, and replacing the respective data-store elements with this.

Subscriptions are currently run in a different thread (via ThreadPoolExecutor).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 IMO comments like this make maintaining code so much easier! Thanks!!!!

@kinow
Copy link
Member

kinow commented Nov 12, 2019

ps: I think Travis will keep failing until cylc-flow's PR is merged: E ImportError: cannot import name 'ZMQSocketBase' from 'cylc.flow.network' (/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/cylc/flow/network/__init__.py)

@kinow
Copy link
Member

kinow commented Nov 12, 2019

Oh, actually had something new in the console with this branch:

zmq.error.ZMQError: Too many open files
2019-11-13 12:32:47,467 tornado.application ERROR    Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f490bece080>>, <Task finished coro=<WorkflowsManager.gather_workflows() done, defined at /home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/workflows_mgr.py:95> exception=ZMQError('Too many open files')>)
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/workflows_mgr.py", line 105, in gather_workflows
    items = await asyncio.gather(*gathers)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/workflows_mgr.py", line 77, in est_workflow
    context=context, timeout=timeout)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/network/client.py", line 129, in __init__
    self.start(host, port)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/network/__init__.py", line 91, in start
    self._start_sequence(*args, **kwargs)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/network/__init__.py", line 105, in _start_sequence
    self._socket_connect(*args, **kwargs)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/network/__init__.py", line 137, in _socket_connect
    self.socket = self.context.socket(self.pattern)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
    s = self._socket_class(self, socket_type, **kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py", line 134, in __init__
    super(_AsyncSocket, self).__init__(context, socket_type, **kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
    super(Socket, self).__init__(*a, **kw)
  File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files

I do have a few too many things running in parallel, but not different than what I normally have everyday. I don't remember seeing a "Too many open files" problem before. Will take a look at my OS limits.

@kinow
Copy link
Member

kinow commented Nov 12, 2019

$ ulimit
unlimited
$ ulimit -Hn
1048576
$ ulimit -n
1024

@kinow
Copy link
Member

kinow commented Nov 12, 2019

Hmmm, restarted the UI Server (while the two workflows are still running), then went to run more benchmarking tests on Cylc UI, then noticed this later in the console:

[I 2019-11-13 12:55:05.620 JupyterHub log:174] 200 GET /hub/api/authorizations/token/[secret] (kinow@127.0.0.1) 34.16ms
2019-11-13 12:55:11,842 tornado.application ERROR    Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fd249f486a0>>, <Task finished coro=<WorkflowsManager.gather_workflows() done, defined at /home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/workflows_mgr.py:95> exception=TypeError("stop() got an unexpected keyword argument 'stop_loop'")>)
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/workflows_mgr.py", line 126, in gather_workflows
    client.stop(stop_loop=False)
TypeError: stop() got an unexpected keyword argument 'stop_loop'

I guess I have to update my branch of the cylc-flow PR too...

@kinow
Copy link
Member

kinow commented Nov 13, 2019

Yup, all good now. No more errors so far 😬 sorry for the noise!

@dwsutherland
Copy link
Member Author

Yup, all good now. No more errors so far 😬 sorry for the noise!

All good. Cylc does raise the open file count, I remember having to address this with Cylc7 in NIWA operations..

Thanks 👍

@kinow
Copy link
Member

kinow commented Nov 13, 2019

Yup, all good now. No more errors so far grimacing sorry for the noise!

All good. Cylc does raise the open file count, I remember having to address this with Cylc7 in NIWA operations..

Thanks +1

Makes sense. The only other thing I noticed is that stopping the hub takes a bit longer now. I assume it's because the UI Server is waiting for the thread or tasks to finish. But that's expected too I believe.

@dwsutherland
Copy link
Member Author

Yup, all good now. No more errors so far grimacing sorry for the noise!

All good. Cylc does raise the open file count, I remember having to address this with Cylc7 in NIWA operations..
Thanks +1

Makes sense. The only other thing I noticed is that stopping the hub takes a bit longer now. I assume it's because the UI Server is waiting for the thread or tasks to finish. But that's expected too I believe.

Yes, correct. I think it's the thread join, but the amount of time seems to be consistent (might be a way to daemonise the threads so we can exit more harshly and quicker).

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 13, 2019

The UI Server now uses the published shutdown topic to stop the subscriber and signal the workflow manager to clear the workflow.

(as a result, the UIS shutdown is quicker if the workflows are shutdown first)

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, a few comments, mostly for future work.

cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
"""
# wait until data-store is populated for this workflow
loop_cnt = 0
while loop_cnt < 5:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another sleep loop to try and replace at a later date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (?).. Subscription happens first, but needs to wait around for initial data payload from REQ client... Publishes are queued until the subscriber recv them, earlier than payload deltas are ignored.

"""
if topic == WORKFLOW:
return
if topic == EDGES:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use id for edges and stamp for everything else?

Copy link
Member Author

@dwsutherland dwsutherland Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because edges don't have dynamic fields (no plan to change that (yet!)). Once created they don't change, a suite reload replaces existing, and they only get deleted thereafter. (guess for consistency we could still do it the same way...)

apply_delta(topic, delta, self.data[w_id])
self.data[w_id]['delta_times'][topic] = delta_time
self.reconcile_update(topic, delta, w_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can catch inconstancies early at this point in code rather than waiting for the next reconcile_update?

else:
    rebuild_store()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I missed your point: The elif delta_time >= self.data[w_id]['delta_times'][topic]: statement filters for valid deltas to apply, so you don't really want to do anything to the store with an invalid delta (delta creation time < latest applied delta creation time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about the impact of out of order messages and whether there is the potential for the data store to become corrupted but for us not to notice until the next message comes in.

I think you've got this covered with the topic/checksum system so probably irrelevant:

  • If the workflow sends messages: 1, 2, 3
  • And the UI Server receives messages: 1, 3, 2
  • Then we know something is out-of-wack, we don't need to wait for message 4 to find that out.

I think this system should pick up on the error, if messages 1, 2 and 3 relate to different topics is the the potential for anything to fall down the cracks?

Copy link
Member Author

@dwsutherland dwsutherland Nov 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the ZMQ message queue can be out of order? and if you hit recv on the subscriber you'd just get the oldest first (post connect).. (unless HWM has been hit, then you'd miss any pruned)

Different topics each hit their different respective part of the data store. There's no confusion of topic, as the topic and respective data arrive together.

cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/data_mgr.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

Codecov Report

Merging #95 into master will decrease coverage by 6.6%.
The diff coverage is 38.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
- Coverage   55.42%   48.81%   -6.61%     
==========================================
  Files           6        6              
  Lines         258      338      +80     
  Branches       38       57      +19     
==========================================
+ Hits          143      165      +22     
- Misses        113      170      +57     
- Partials        2        3       +1
Impacted Files Coverage Δ
cylc/uiserver/data_store_mgr.py 28.84% <28.84%> (ø)
cylc/uiserver/resolvers.py 44% <50%> (ø) ⬆️
cylc/uiserver/main.py 66.07% <53.84%> (-3.16%) ⬇️
cylc/uiserver/workflows_mgr.py 48.45% <60%> (-1.55%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8526e67...251e0df. Read the comment docs.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested (lots) as working. Discussed sync verification (again) with @dwsutherland offline (Riot.im).

@hjoliver
Copy link
Member

hjoliver commented Dec 3, 2019

I'll merge this, for compatibility with the back end; and @oliver-sanders comments above suggest he's as good as approved it.

@hjoliver hjoliver merged commit ed4574b into cylc:master Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants