Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-mpi not working #30

Closed
andersy005 opened this issue May 2, 2019 · 6 comments
Closed

dask-mpi not working #30

andersy005 opened this issue May 2, 2019 · 6 comments

Comments

@andersy005
Copy link
Member

I have this script dask_mpi_test.py:

from dask_mpi import initialize
initialize()

from distributed import Client 
import dask
client = Client()

df = dask.datasets.timeseries()

print(df.groupby(['time', 'name']).mean().compute())
print(client)

When I try to run this script with:

mpirun -np 4 python dask_mpi_test.py

I get these errors:
~/workdir $ mpirun -np 4 python dask_mpi_test.py
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://xxxxxx:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.worker - INFO -       Start worker at: tcp://xxxxx:44712
/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/bokeh/core.py:57: UserWarning:
Port 8789 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
  warnings.warn('\n' + msg)
distributed.worker - INFO -       Start worker at: tcp://xxxxxx:36782
distributed.worker - INFO -          Listening to:               tcp://:44712
distributed.worker - INFO -              bokeh at:                      :8789
distributed.worker - INFO -          Listening to:               tcp://:36782
distributed.worker - INFO - Waiting to connect to: tcp://xxxxxx:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -              bokeh at:                     :43876
distributed.worker - INFO - Waiting to connect to: tcp://xxxxx:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    3.76 GB
distributed.worker - INFO -                Memory:                    3.76 GB
distributed.worker - INFO -       Local Directory: /gpfs/fs1/scratch/abanihi/worker-uoz0vtci
distributed.worker - INFO -       Local Directory: /gpfs/fs1/scratch/abanihi/worker-bb0u_737
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - -------------------------------------------------
Traceback (most recent call last):
  File "dask_mpi_test.py", line 6, in <module>
    client = Client()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 640, in __init__
    self.start(timeout=timeout)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 763, in start
    sync(self.loop, self._start, **kwargs)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/utils.py", line 321, in sync
    six.reraise(*error[0])
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/utils.py", line 306, in f
    result[0] = yield future
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 851, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/distributed/client.py", line 892, in _ensure_connected
    self._update_scheduler_info())
  File "/glade/work/abanihi/softwares/miniconda3/envs/analysis/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.Timeout
$ conda list dask
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
dask                      1.2.0                      py_0    conda-forge
dask-core                 1.2.0                      py_0    conda-forge
dask-jobqueue             0.4.1+28.g5826abe          pypi_0    pypi
dask-labextension         0.3.3                    pypi_0    pypi
dask-mpi                  1.0.2                    py37_0    conda-forge

$ conda list tornado
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
tornado                   5.1.1           py37h14c3975_1000    conda-forge
$ conda list distributed
# packages in environment at /glade/work/abanihi/softwares/miniconda3/envs/analysis:
#
# Name                    Version                   Build  Channel
distributed               1.27.0                   py37_0    conda-forge

Is anyone aware of anything that must have happened in an update to dask or distributed to cause dask-mpi to break?

Ccing @kmpaul

@kmpaul
Copy link
Collaborator

kmpaul commented May 2, 2019

The last CircleCI tests ran with dask=1.1.0 and distributed=1.25.2. However, I've tried to reproduce the same environment as was run in the last CircleCI test, and it fails on my laptop. ...Yet, rerunning the CircleCI test worked fine.

@bocklund
Copy link

bocklund commented Jun 8, 2019

I can reproduce this with the following environment on macOS.

I am running
mpirun dask-mpi --scheduler-file my_scheduler.json --nthreads 1
python -c "from distributed import Client; c = Client(scheduler_file='my_scheduler.json')"

I see this issue with:

dask                      1.2.0                      py_0    conda-forge
dask-core                 1.2.0                      py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.28.1                   py36_0    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

and (downgraded dask)

dask                      1.1.5                      py_0    conda-forge
dask-core                 1.1.5                      py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.28.1                   py36_0    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

and (downgraded distributed to 1.27.1)

dask                      1.2.0                      py_0    conda-forge
dask-core                 1.2.0                      py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.27.1                   py36_0    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

and

dask                      1.1.5                      py_0    conda-forge
dask-core                 1.1.5                      py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.26.1                   py36_0    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

and

dask                      1.1.1                      py_0    conda-forge
dask-core                 1.1.1                      py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.25.3                   py36_0    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

however, the following works!

dask                      0.20.2                     py_0    conda-forge
dask-core                 0.20.2                     py_0    conda-forge
dask-mpi                  1.0.2                    py36_0    conda-forge
distributed               1.24.2                py36_1000    conda-forge
tornado                   6.0.2            py36h01d97ff_0    conda-forge

Downgrading distributed below 1.25 to 1.24 and dask to 0.20 (below 1.0) seems to work. Since they are coupled, I'm not sure where the issue is, but it's clearly upstream of dask-mpi.

@Timshel
Copy link

Timshel commented Jun 11, 2019

I had the same timeout problem.
I was able to run my job while using dask-scheduler instead of dask-mpi to create the scheduler.

After some search it appears that the main difference of the dask-scheduler cli is that it's using the current tornado IoLoop : https://github.com/dask/distributed/blob/1.28.1/distributed/cli/dask_scheduler.py#L197

Using current instead of a new instance here : https://github.com/dask/dask-mpi/blob/master/dask_mpi/core.py#L50 : and it's running.

For reference I'm running :

dask==1.2.2
dask-mpi==1.0.2
distributed==1.28.1
tornado==6.0.2

Edit : Fixed the wrong link, I meant to change in initialize.

@andersy005
Copy link
Member Author

@Timshel, @bocklund, thank you for chiming in. I am going to take a stab at a fix.

Moving forward, we may need to extend our testing environment to test different combinations of dask and distributed versions (or at least make sure that everything works with the latest versions).

@basnijholt
Copy link

I am getting the same problem with the latest versions of dask and distributed and running the example from the docs.

This is blocking basnijholt/adaptive-scheduler#11.

@kmpaul
Copy link
Collaborator

kmpaul commented Jun 13, 2019

Fixed with #33. Thank you @Timshel for the tip.

@kmpaul kmpaul closed this as completed Jun 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants