Dask on HPC, what works and what doesn't #5

mrocklin · 2018-12-05T14:41:35Z

Hi All,

I'd like for a group of us to write a blogpost about using Dask on supercomputers, including why we like it today, and highlighting improvements that could be done in the near future to improve usability. My goal for this post is to show it around to various HPC groups, and to show it to my employer to motivate work in this area. I think that now is a good time for this community to have some impact by sharing its recent experience.

cc'ing some notable users today @guillaumeeb @jhamman @kmpaul @lesteve @dharhas @josephhardinee @jakirkham

To start conversation, if we were to structure the post as five reasons we use Dask on HPC and five things that could be better, what would be those five things? I think it'd be good to get a five-item list from a few people cc'ed above, then maybe we talk about those lists and I (or anyone else if interested) composes an initial draft that we can then all iterate on?

kmpaul · 2018-12-06T15:19:38Z

@jhamman can probably amend or append to this...

5 Reasons We Use Dask at NCAR

In my mind, these are the 5 best things that we get from Dask, here at NCAR:

Cost: Dask is free and open source, which means we at NCAR do not have to rebalance our budget and staff to address the new immediate need of data analysis tools that we have failed to invest in over the last decade as Big Data problems have reared their ugly head.
Scalability: Dask provides the scalability we need to deal with the massively large datasets that our existing (primarily serial) data analysis tools cannot.
Simplicity: By abstracting the parallelism away from the user/developer, our analysis tools can be written by computer science non-experts, such as the scientists themselves, meaning that our software engineers can take on a more supporting role than a leadership role.
Inspiration: Dask is more than just a tool to NCAR; it is a gateway to thinking about a completely different way of providing computing infrastructure to our users. Dask opens up the door to cloud computing technologies (such as elastic scaling and object storage) and makes us rethink what an HPC center should really look like!
Collaboration: Dask, and members of the Dask development team, have been critical collaborators with NCAR staff, helping us learn a new way of conducting business in the open source software world.

3 Things That Could Be Better

I probably need some help developing this list, as this list is short and very self-centered. In fact, some of these things could actually already have solutions...which I would be happy to hear about.

Batch Launching: For many at NCAR, launching a parallel job means 2 things: (1) writing a PBS or SLURM batch script that (2) runs your parallel application with mpirun or mpiexec. While Dask is fantastic for interactive jobs with Jupyter Lab/Notebooks, a great deal of our workflows at NCAR are still (and probably will be for the foreseeable future) chained batch-job workflows. To fit into those workflows (without extra modifications), a single batch job must be submitted that launches the Scheduler, Workers, and the Client (and application, obviously) together.
Coarse-Grained Diagnostics: Dask provides a number of profiling tools that give you diagnostics at the individual task-level, but (as far as I know), there is no way of analyzing or profiling your Dask application at the (for lack of a better term) "unchunked" task-level (or the "operator-level", if that makes sense). The best way I know of getting coarse-grained diagnostics is to scale your application down to a much smaller size so that you can run it serially. However, some applications just don't show problems, or at least do not indicate the truly problematic step, when scaled back.
Scheduler Profiling: When large graphs get created, there can be long delays before any sign of execution shows up on the dashboard. I can only assume that this is because of work that the Scheduler is doing that is not being displayed. It would be nice to get some information about what the Scheduler is doing at these moments. Can this be done today?

...@jhamman, is there anything you would add?

mrocklin · 2018-12-06T17:30:31Z

Thanks for getting this started @kmpaul !

kmpaul · 2018-12-06T18:29:32Z

Oh, and I just was reminded of another "Thing That Could Be Better"...

Non-Dashboard Diagnostics: Is there a way to get the same information displayed on the dashboard dumped to a file(s) that can be processed after the job? That would be really useful.

guillaumeeb · 2018-12-06T21:22:27Z

My contribution, just some quick thoughts:

Reasons we use Dask at CNES

Ease of use: extension of Numpy or Pandas, well known API for scientists and engineers, or other simple ones for doing multinode python multiprocessing.
Versatility of APIs: Arrays, dataframes, bag, delayed, futures, everthing is possible. More accessible, and more possibilities than with Spark.
Smooth HPC integration with dask-jobqueue (and dask-mpi): no need of any boilerplate bash code with qsub, interactive analysis at scale and auto scaling is just awesome.
Infrastructure agnostic: compatible with laptop, server, HPC cluster, and even Cloud computing if needed, environment may change with very little code adaptations, and great scalability.
Aimed for scientific processing, compatible with several scientific data format (thanks to Xarray and other libraries): NetCDf, Parquet, remote sensing incomming with RasterIO... Not only text analysis, Integrated into Scientific Python stack

Things that could be better

Heterogeneous resources handling: workers with low or high memory, workers with GPUS, all in the same Dask cluster, multiple node pools... See Cluster should support many worker types distributed#2118
Scheduler in a job: Separating the scheduler from the notebook/ipython console/user process. See Redesign Cluster Managers distributed#2235
Link with Deep Learning: how to feed data retrieved and organized by Dask into DL algorithm, interface between Dask and TensorFlow/PyTorch, could we launch Distributed TensorFlow from Dask?
Task and submission history (maybe this is already possible?), one feature of Spark that Dask miss for processing post analysis.
Remote sensing data (Jpeg2000, GeoTiff) integration to improve, and probably other scientific format (Fits ...)? But not directly related do Dask.

Also 👍 to Scheduler Profiling from @kmpaul, ~~but 👎 on Batch Launching that is already covered by Dask IMO~~

guillaumeeb · 2018-12-06T21:25:22Z

And I'm also very interested in contributing into the blog post.

There may also be ideas to take in @jhamman post that you already relayed on dask-blog: https://blog.dask.org/2018/10/08/Dask-Jobqueue.

kmpaul · 2018-12-06T21:31:46Z

@guillaumeeb I'd be interested to hear how you solve the Batch Launching problem. And, if you feel it is a solved problem, I'm obviously happy to take it off the list (and learn something myself!).

guillaumeeb · 2018-12-06T21:39:34Z

Maybe I'm missunderstanding your point, but isn't dask-mpi just there for batch launching dask applications? Happy to discuss on gitter about this in order to not pollute this issue.

kmpaul · 2018-12-06T21:41:10Z

Sure. Let's move to gitter.

guillaumeeb · 2018-12-06T22:19:19Z

@kmpaul just convinced me that dask-mpi was not yet sufficient to implement correctly batch launching, so in the end 👍 to improved batch launching too!

mrocklin · 2018-12-17T14:35:14Z

I'm happy to start writing up a skeleton draft of this if people are interested.

Alternatively, it would still be good to get thoughts from others. I wonder if now that AGU is over folks like @rabernat or @jhamman have time. I'm also interested in thoughts from non-earth-scientists like @lesteve and @ogrisel if they have time to list some general thoughts.

guillaumeeb · 2018-12-17T15:57:59Z

ping also @willirath and @apatlpo.

dharhas · 2018-12-17T18:40:43Z

Apologies on the delayed response. @guillaumeeb list is actually pretty spot on. Very interested on the heterogeneous resource launching and improved batch launching.

willirath · 2018-12-18T10:21:50Z

Reasons we use Dask at GEOMAR

(In addition to virtually all of the above)

Easy to teach: When training people in using Dask, it's very easy to expose them to exactly the fraction of the API that is necessary for the task at hand.

Things that could be better

Heterogeneous clusters: Both, making them easier to launch and having a simple way of associating built-in methods of Dask arrays with resources would be great.

mrocklin · 2018-12-30T02:14:39Z

I've added a quick draft here: #6

Help filling things in there would be welcome.

kmpaul mentioned this issue Dec 6, 2018

Could dask-mpi run the client script too? dask/distributed#2402

Closed

mrocklin mentioned this issue Dec 30, 2018

Add draft of dask-on-hpc post #6

Merged

mrocklin closed this as completed in #6 Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask on HPC, what works and what doesn't #5

Dask on HPC, what works and what doesn't #5

mrocklin commented Dec 5, 2018

kmpaul commented Dec 6, 2018 •

edited

Loading

mrocklin commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018 •

edited

Loading

guillaumeeb commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018

mrocklin commented Dec 17, 2018

guillaumeeb commented Dec 17, 2018

dharhas commented Dec 17, 2018

willirath commented Dec 18, 2018

mrocklin commented Dec 30, 2018

Dask on HPC, what works and what doesn't #5

Dask on HPC, what works and what doesn't #5

Comments

mrocklin commented Dec 5, 2018

kmpaul commented Dec 6, 2018 • edited Loading

5 Reasons We Use Dask at NCAR

3 Things That Could Be Better

mrocklin commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018 • edited Loading

Reasons we use Dask at CNES

Things that could be better

guillaumeeb commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018

kmpaul commented Dec 6, 2018

guillaumeeb commented Dec 6, 2018

mrocklin commented Dec 17, 2018

guillaumeeb commented Dec 17, 2018

dharhas commented Dec 17, 2018

willirath commented Dec 18, 2018

Reasons we use Dask at GEOMAR

Things that could be better

mrocklin commented Dec 30, 2018

kmpaul commented Dec 6, 2018 •

edited

Loading

guillaumeeb commented Dec 6, 2018 •

edited

Loading