Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several "ERROR:asyncio:Future exception was never retrieved" errors in log #65

Closed
kinow opened this issue Aug 19, 2019 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@kinow
Copy link
Member

kinow commented Aug 19, 2019

Hi,

I have been running a modified workflow five with the following code:

[meta]
    title = "Inter-cycle dependence + a cold-start task"
[cylc]
    UTC mode = True
[scheduling]
    #runahead limit = 120
    initial cycle point = 20130808T00
    final cycle point = 20190812T00
    [[dependencies]]
        [[[R1]]]
            graph = "prep => foo"
        [[[PT12H]]]
            graph = "foo[-PT12H] => foo => bar"
[runtime]
[[root]]
script = "sleep 2; echo 'done!'"

I've modified it just so it would take a bit longer to finish, and give me more time to work on Vue.js code for Cylc UI, with real data.

The workflow has been running since this morning (around 10AM I think?), and around 1 PM I noticed that the UI would have the workflow tree - in a new component - disappear completely.

Checking the logs, Cylc Flow appears to be OK, as well as Cylc UI. But Cylc UI Server has the logs flooded with:

ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=ZMQError('Operation cannot be accomplished in current state')>
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py", line 363, in _add_send_event
    r = send(msg, **nowait_kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/sugar/socket.py", line 395, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 733, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 780, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 247, in zmq.backend.cython.socket._send_copy
  File "zmq/backend/cython/socket.pyx", line 242, in zmq.backend.cython.socket._send_copy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Operation cannot be accomplished in current state

Haven't investigated any further as I'm still working on some JS issues with that component. I am using master on Cylc UI Server and on Cylc Flow. And a custom branch on Cylc UI (tree-component-1).

Cheers
Bruno

@kinow
Copy link
Member Author

kinow commented Aug 19, 2019

Screenshot_2019-08-19_15-06-10

@hjoliver
Copy link
Member

I have also seen this.

@kinow
Copy link
Member Author

kinow commented Aug 19, 2019

From what I found in some 15 minutes googling, it appears to be something with the client. Could be a deadlock, it could be the order of req-rep (had to google dealer concept for zmq), or could even be something like an existing bug. Hopefully it's something easy to identify the bug and fix it.

@kinow
Copy link
Member Author

kinow commented Aug 20, 2019

In the morning I executed the suite several times, but killing it before it completed 20 minutes running (was developing a small feature for the Tree component).

And I noted the component never blinked, always displayed correctly.

In the afternoon I left the workflow five running from 1PM to 4:15PM, and near 4PM when I checked the logs (was reading on the Windows computer), the cylc-uiserver log had these messages again.

Furthermore, the UI Tree component started to "blink" again. And I noticed that the GraphQL server was returning [] for the queries every now and then (probably the reason for the component to disappear for a short time).

@kinow
Copy link
Member Author

kinow commented Aug 28, 2019

Found an easier way to reproduce it.

  1. Build Cylc UI (npm run build)
  2. Start Cylc UI Server (jupyterhub after having installed all dependencies, etc)
  3. Start a workflow (e.g. cylc run --no-detach --verbose --debug five)
  4. At this point, nothing wrong happens. I left my notebook running a workflow for 5 hours, and nada
  5. So to reproduce the issue:
  6. Eventually you will get the exception. I managed to close everything, restart my environment, and then reproduced the exception in under 1 minute 🎉

@kinow
Copy link
Member Author

kinow commented Aug 28, 2019

Bonus: you can add do the following to troubleshoot asyncio with a JupyterHub application:

  1. export PYTHONASYNCIODEBUG=1 - this will enable asyncio debug for the Hub... It is important to know that the Hub filters what environment variables are inherited by its applications (for security)
  2. Add the following to your jupyterhub_config.py: c.Spawner.env_keep = ['PATH', 'PYTHONPATH', 'CONDA_ROOT', 'CONDA_DEFAULT_ENV', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'PYTHONASYNCIODEBUG'] - now cylc-uiserver will inherit all these variables, including PYTHONASYNCIODEBUG

Thanks to @hjoliver for the suggestion. Here's the traceback:

WARNING:asyncio:Executing <Task finished coro=<get_workflow_data() done, defined at /home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/data_mgr.py:28> result=('kinow/five', 'TIMEOUT') created at /home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/tasks.py:719> took 2.003 seconds
WARNING:asyncio:Executing <Task finished coro=<get_workflow_data() done, defined at /home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/data_mgr.py:28> result=('kinow/five', 'TIMEOUT') created at /home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/tasks.py:719> took 2.004 seconds
ERROR:asyncio:Future exception was never retrieved
future: <Future finished exception=ZMQError('Operation cannot be accomplished in current state') created at /home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py:345>
source_traceback: Object created at (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/bin/cylc-uiserver", line 11, in <module>
    load_entry_point('cylc-uiserver', 'console_scripts', 'cylc-uiserver')()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/main.py", line 158, in main
    ui_server.start()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/main.py", line 131, in start
    ioloop.IOLoop.current().start()
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/base_events.py", line 539, in run_forever
    self._run_once()
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/base_events.py", line 1767, in _run_once
    handle._run()
  File "/home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/data_mgr.py", line 31, in get_workflow_data
    pb_msg = await client.async_request(method)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/cylc/flow/network/client.py", line 142, in async_request
    self.socket.send_string(message)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/sugar/socket.py", line 563, in send_string
    return self.send(u.encode(encoding), flags=flags, copy=copy, **kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py", line 209, in send
    return self._add_send_event('send', msg=msg, kwargs=kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py", line 345, in _add_send_event
    f = future or self._Future()
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/_future.py", line 363, in _add_send_event
    r = send(msg, **nowait_kwargs)
  File "/home/kinow/Development/python/workspace/cylc-uiserver/venv/lib/python3.7/site-packages/zmq/sugar/socket.py", line 395, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 733, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 780, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 247, in zmq.backend.cython.socket._send_copy
  File "zmq/backend/cython/socket.pyx", line 242, in zmq.backend.cython.socket._send_copy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Operation cannot be accomplished in current state
WARNING:asyncio:Executing <Task finished coro=<get_workflow_data() done, defined at /home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/data_mgr.py:28> result=('kinow/five', None) created at /home/kinow/Development/python/anaconda3/lib/python3.7/asyncio/tasks.py:719> took 2.003 seconds

(I left the WARNING entries... these appear when a coroutine stops, so we can see what stopped running in the main loop.. these popped up every time I stopped the workflow and then restarted it... in this case it was quite fast to reproduce the error)

I think this gives a good starting point to @dwsutherland or someone else who knows how it works to investigate the issue, in special this line I think.

  File "/home/kinow/Development/python/workspace/cylc-uiserver/cylc/uiserver/data_mgr.py", line 31, in get_workflow_data
    pb_msg = await client.async_request(method)

@kinow
Copy link
Member Author

kinow commented Aug 28, 2019

(now back to an ugly bug with WebPack in Cylc UI 😢 so I am not working on this right now 👍 )

@dwsutherland
Copy link
Member

@kinow - Thanks for the investigation. I think it won't be an issue once the data-store sync is in, as it appears to be a problem with the 5sec PeriodicCallback data-store update.

I might run some timers on instantiating the clients on request (i.e. like Tornado does with the handlers)...

Perhaps the root cause might be:

  • A timeout of one REQ, which returns None (hence the no data, although I would think we would see the timeout message), then the next REQ happens while it's still in lock..

I think this is likely because of the workflow stop and start to reproduce the issue..

A fix could be to catch this error and ignore (don't update the data) with the data manager request, and let the workflow manager clean up the old suite on it's 10sec Periodic callback.

@kinow kinow added the bug Something isn't working label Aug 29, 2019
@kinow kinow added this to the 0.1 milestone Sep 10, 2019
@kinow kinow modified the milestones: 0.1, 0.2 Sep 18, 2019
@kinow
Copy link
Member Author

kinow commented Sep 18, 2019

Moving to 0.2 as we won't be able to investigate it on time for 0.1, and is not a blocker IMO.

@kinow
Copy link
Member Author

kinow commented Dec 13, 2020

Just tried it again, after re-enabling the configuration settings in JuptyerHub configuration. Glad to announce this is not happening any longer. Closing (can be searched and linked or re-opened in case the issue occurs again)

@kinow kinow closed this as completed Dec 13, 2020
@kinow kinow modified the milestones: 1.0, 0.3 Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants