Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we be testing against multiple dask schedulers? #1971

Closed
jhamman opened this issue Mar 7, 2018 · 5 comments
Closed

Should we be testing against multiple dask schedulers? #1971

jhamman opened this issue Mar 7, 2018 · 5 comments

Comments

@jhamman
Copy link
Member

jhamman commented Mar 7, 2018

Almost all of our unit tests are against the dask's default scheduler (usually dask.threaded). While it is true that beauty of dask is that one can separate the scheduler from the logical implementation, there are a few idiosyncrasies to consider, particularly in xarray's backends. To that end, we have a few tests covering the integration of the distributed scheduler with xarray's backends but the test coverage is not particularly complete.

If nothing more, I think it is worth considering tests that use the threaded, multiprocessing, and distributed schedulers for a larger subset of the backends tests (those that use dask).

Note, I'm bringing this up because I'm seeing some failing tests in #1793 that are unrelated to my code change but do appear to be related to dask and possibly a different different default scheduler (example failure).

@shoyer
Copy link
Member

shoyer commented Mar 7, 2018

Huh, that's interesting. Yes, I suppose should at least consider parametric tests using both dask's multithreaded and distributed schedulers. Though I'll note that for test we actually set the default scheduler to dask's basic non-parallelized get, for easier debugging:

dask.set_options(get=dask.get)

For #1793, the key thing would be to ensure that we run the tests in the isolated context without changing the default scheduler.

@jhamman
Copy link
Member Author

jhamman commented Mar 8, 2018

I managed to dig up some more information here. I was having a test failure in test_serializable_locks resulting in a traceback that looks like.

...
            timeout_handle = self.add_timeout(self.time() + timeout, self.stop)
        self.start()
        if timeout is not None:
            self.remove_timeout(timeout_handle)
        if not future_cell[0].done():
>           raise TimeoutError('Operation timed out after %s seconds' % timeout)
E           tornado.ioloop.TimeoutError: Operation timed out after 10 seconds

../../../anaconda/envs/xarray36/lib/python3.6/site-packages/tornado/ioloop.py:457: TimeoutError

From then on we were using the distributed scheduler and any tests that used dask resulted in a additional timeout (or similar error).

Unfortunately, my attempts to provide a mcve have come up short. If I can come up with one, I'll report upstream but as it is, I can't really transfer this behavior outside of my example.

cc @mrocklin

@mrocklin
Copy link
Contributor

mrocklin commented Mar 8, 2018

FWIW most of the logic within the dask collections (array, dataframe, delayed) is only tested with dask.local.get_sync. This also makes the test suite much faster.

Obviously though for things like writing to disk it's useful to check different schedulers.

@Karel-van-de-Plassche
Copy link
Contributor

Karel-van-de-Plassche commented May 28, 2018

Seems like the distributed scheduler is the advised one to use in general, so maybe some tests could be added for this one. For sure for diskIO, would be interesting to see the difference.

http://dask.pydata.org/en/latest/setup.html

Note that the newer dask.distributed scheduler is often preferable even on single workstations. It contains many diagnostics and features not found in the older single-machine scheduler. The following pages explain in more detail how to set up Dask on a variety of local and distributed hardware.

@jhamman
Copy link
Member Author

jhamman commented Jan 13, 2019

Closing this now. The distributed integration test module seems to be covering our IO use cases well enough. I don't think we need to do anything here at this time.

@jhamman jhamman closed this as completed Jan 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants