Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLRunner stuck on "OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1'" while running arboreto #48

Closed
Oakento opened this issue Feb 15, 2021 · 2 comments

Comments

@Oakento
Copy link

Oakento commented Feb 15, 2021

Hi,
I was trying to run grnbeeline/arboreto:base through BLRunner.py as the following command.

docker run --rm -v /home/abc/projects/Beeline:/data/ --expose=41269 grnbeeline/arboreto:base /bin/sh -c "time -v -o data/outputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/time.txt python runArboreto.py --algo=GENIE3 --inFile=data/inputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/ExpressionData.csv --outFile=data/outputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/outFile.txt "

However, an error occurred and the program stuck.

Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f4a22da7250>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:320> exception=OSError("Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 322, in connect
    _raise(error)
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 772, in live_comm
    **self.connection_args,
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 334, in connect
    _raise(error)
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time

The error is not stable that there is a probability of the error in different places in multiple attempts.

Additionally, the containers are running under docker's bridge network.

@adyprat
Copy link
Collaborator

adyprat commented Feb 15, 2021

Hi,
The program should run despite that "tornado application error" inside the docker. According to the authors of Arboreto, you can ignore those errors (aertslab/arboreto#10).
So long as the docker is running, the algorithm should be running. I'm assuming you are trying to run GENIE3 on a large-ish dataset (thousands of genes?), which will take a while to complete. If the docker exits without any output, then let me know.
Best,
Aditya

@smartpig-666
Copy link

I encountered the same problem. How did you solve it in the end?

`distributed.comm.inproc - WARNING - Closing dangling queue in
Traceback (most recent call last):
File "runArboreto.py", line 43, in
main(sys.argv)
File "runArboreto.py", line 32, in main
network = genie3(inDF.to_numpy(), client_or_address = client, gene_names = inDF.columns)
File "/opt/conda/lib/python3.7/site-packages/arboreto/algo.py", line 73, in genie3
limit=limit, seed=seed, verbose=verbose)
File "/opt/conda/lib/python3.7/site-packages/arboreto/algo.py", line 135, in diy
.compute(graph, sync=True)
File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 2919, in compute
result = self.gather(futures)
File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 1993, in gather
asynchronous=asynchronous,
File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 834, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 339, in sync
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 323, in f
result[0] = yield future
File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
concurrent.futures._base.CancelledError
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb07674df90>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:320> exception=OSError("Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 322, in connect
_raise(error)
File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _close
await self._correct_state()
File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
comm = await self.live_comm()
File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 772, in live_comm
**self.connection_args,
File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 334, in connect
_raise(error)
File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants