Add draft of dask-on-hpc post #6

mrocklin · 2018-12-30T02:14:15Z

Fixes #5

This is a work in progress. It includes some basic text copied from the issue from responses from @guillaumeeb and @kmpaul along with some filler from myself. I've currently listed everyone active on that issue as an author. Please let me know if you'd like to be added or removed.

Help filling things in here would be appreciated. People should be able to contribute text in the following ways:

Direct commits / PRs (happy to give anyone commit access to my fork
Comments on this PR with suggested text

jhamman

Just a few comments on the current draft.

jhamman · 2019-01-02T17:38:55Z

_posts/2018-12-29-dask-on-hpc.md

+
+Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and
+so on.  This is because Dask works with other libraries within the Python
+ecosystem, like XArray, which already has strong support for scientific data


nit: XArray-->xarray (or Xarray)

jhamman · 2019-01-02T17:40:37Z

_posts/2018-12-29-dask-on-hpc.md

+completely different way of providing computing infrastructure to our users.
+Dask opens up the door to cloud computing technologies (such as elastic scaling
+and object storage) and makes us rethink what an HPC center should really look
+like!


I'm not sure this is the correct place but I suggest adding a few simple code examples to highlight the usage pattern of the Cluster managers concept.

jhamman · 2019-01-02T17:41:10Z

_posts/2018-12-29-dask-on-hpc.md

+
+Dask is free and open source, which means we do not have to rebalance our
+budget and staff to address the new immediate need of data analysis tools.
+There are no licenses, and we have the ability to make changes when necessary.


nit: there are licenses, just we just don't have to pay for access to them.

jhamman · 2019-01-02T17:42:17Z

_posts/2018-12-29-dask-on-hpc.md

+Dask provides a number of profiling tools that give you diagnostics at the
+individual task-level, but there is no way today to analyze or profile your
+Dask application at a coarse-grained level.  Having more tools to analyze bulk
+performance would be helpful when making design decisions.


One tangible idea here would be a benchmarking suite that helps users make decisions about how to use dask most effectively.

apatlpo · 2019-01-03T13:53:30Z

_posts/2018-12-29-dask-on-hpc.md

+---
+{% include JB/setup %}
+
+We analyze large datasets on HPC systems with Dask, a library for scalability


large datasets is vague, can you be more quantitative (e.g. 1-100TB ...) ?

apatlpo · 2019-01-03T14:09:13Z

_posts/2018-12-29-dask-on-hpc.md

+
+
+# What needs work
+


At the moment, guidelines for adequate strategies or tutorials on prototypical calculations are limited to my opinion and could be one item in this list.

I understand the dashboard is here to help you decide on you own how to adapt your computations. From experience, this is not as simple as it seems and multiple trial and errors is the norm in order to perform large scale calculations (rechunking on O(10TB) data is typical computation in my group for example).

One issue that we have to face typically on HPC platforms is that the dataset does not or does barely fit in memory.
My experience tells me this has profound implications on the strategies you have to develop compared to platforms where the dataset easily fits in memory.

guillaumeeb

Some contributions.

guillaumeeb · 2019-01-03T14:03:56Z

_posts/2018-12-29-dask-on-hpc.md

+and works comfortably with native HPC hardware.
+
+This article explains why this makes sense for us.
+Our motivation is to share our experiences with our colleagues,


One of my motivation is to make HPC users aware of what Dask can offer to them.
Help them identify use cases could be nice too, but probably out of scope here.

guillaumeeb · 2019-01-03T14:11:16Z

_posts/2018-12-29-dask-on-hpc.md

+Dask extends libraries like Numpy, Pandas, and Scikit-learn, which are well
+known APIs for scientists and engineers.  It also extends simpler APIs for
+multi-node multiprocessing.  This makes it easy for our existing user base to
+get up to speed.


Maybe we should compare to MPI/OpenMP and even Spark.

"Common solutions for distributed processing on HPC are job arrays or MPI based frameworks. The first fits only simple use cases while tending to generate boiler plate bash scripts, second is very powerful but complicated and more suited to heavy C or Fortran numerical simulations. Dask covers easily the first one and part of the possibility offered by MPI, it is also easier and more complete than PySpark."

guillaumeeb · 2019-01-03T14:12:54Z

_posts/2018-12-29-dask-on-hpc.md

+All the infrastructure that we need is already in place.
+
+Also, interactive analysis at scale and auto scaling is just awesome, and lets
+us use our existing infrastructure in new ways and improves our occupancy


improve and optimize occupancy.

Adaptivity can prevent some users to book compute nodes for too long.

guillaumeeb · 2019-01-03T14:15:22Z

_posts/2018-12-29-dask-on-hpc.md

+Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and
+so on.  This is because Dask works with other libraries within the Python
+ecosystem, like XArray, which already has strong support for scientific data
+formats and processing.


As already mentionned (so I don't know if this has to be writen again, but this is a strong argument in this bullet): Dask extends Numpy, at the heart of the Scientific Python ecosystem.

guillaumeeb · 2019-01-03T14:18:20Z

_posts/2018-12-29-dask-on-hpc.md

+
+And yet Dask is not designed for any particular workflow, and instead can
+provide infrastructure to cover a variety of different problems within an
+institution.  Everything is possible.


Maybe we should detail a bit here? I guess it depends on the final length of the post...

You can easily handle Numpy arrays ir Pandas Dataframes at scale, doing some numerical work or data analysis/cleaning.

You can handle any object collections, say from JSON file or any other with Bag,

You can express any workflow with Delayed, or parallelize anything with Delayed or Futures.

guillaumeeb · 2019-01-03T14:21:47Z

_posts/2018-12-29-dask-on-hpc.md

+
+### 4.  ~~Batch Launching~~
+
+*This was resolved while we prepared this blogpost.  Hooray for open source.*


I propose to keep this section (but this is probably what is intended here). Just add a link to dask-mpi!

guillaumeeb · 2019-01-03T14:23:36Z

_posts/2018-12-29-dask-on-hpc.md

+
+### 3.  Scheduler Performance on Large Graphs
+
+


For this part. My impression is that you can block several minutes waiting for a computation to start when submitting 100,000 or millions of tasks. This can be costly when booking several nodes on HPC, but also on cloud environment!

guillaumeeb · 2019-01-03T14:23:56Z

_posts/2018-12-29-dask-on-hpc.md

+### 5.  More Data Formats
+
+
+### 6.  Task and submission history


Is this redundant with point 2?

mrocklin · 2019-01-04T01:48:14Z

Thanks all for the comments. If I can ask folk to go even further, would people be willing to directly edit this? I'm curious about what the best way is to modify this document as a group. As an experiment I've put things into hackmd.io here: https://hackmd.io/98wCLMyiTM-juP8S4FhTXA . It should be editable after sign-on. Would people be comfortable filling things in directly this way?

mrocklin · 2019-01-11T18:35:25Z

@jhamman and I have added a bit more content to the hackpad. If others have time to log in, check things out, and modify/add things that would be very welcome.

guillaumeeb · 2019-01-14T16:18:10Z

I've added some ideas, this probably needs some rereading though.

mrocklin · 2019-03-06T23:10:30Z

Thanks @guillaumeeb !

I've also just gone through it again to smooth out some language. I would like to publish this soonish.

@jhamman you had some suggestions on where we might place code samples. Would you be willing to add these?

Also, who should we place on the authors list? I'm inclined to include everyone who participated on this issue if there are no objections to that.

apatlpo · 2019-03-07T09:23:42Z

Would it be worth adding a sense of priority in the section What needs work ?
or does it depend too much on each person's point of view/interest?

guillaumeeb · 2019-03-11T20:35:43Z

This looks quite good!

Do we need more code samples? Some pictures? No great idea from me here...

For the priority, I fear this will be to personal, but here's my list : more data formats (need good image format integration), large graphs, deep learning, heterogenous resources, guidelines, diagnostics.

apatlpo · 2019-03-12T08:33:52Z

I understand priority may be personal but some topics may stand out even after an average over users.

Also, Dask dev team surely does not have that much ressources that they can tackle everything at once.
I'd be curious to know how does the team select development priorities actually.

Is there an input from the user community? If yes, how is this input collected?

A poll could have been helpful to investigate this and the post could have been an opportunity to show its result.

guillaumeeb · 2019-03-15T16:54:24Z

So @apatlpo I'm curious what would be your priority order?

And yes we obviously can't tackle everything :). To answer your question from only my personal point of view (so I may be wrong here, and there are probably things I don't quite know yet about Dask governance) :

The input from the user community comes from issues on github and probably other places like Stack Overflow or conferences like SciPy.
Dask team tries to meet every month, but the discussion are really high level decisions and actions.
Development priorities are often opportunistic: core developers or other punctual contributors will work on what they need or feel is useful at the moment.
We have a https://github.com/dask/governance project, but I don't this we discussed how to propose a Roadmap or equivalent yet. But maybe there is already something that exist and I'm not aware of. Maybe we should raise an issue here?

And we could propose a poll at the end of the post to see what the readers most want, sounds like a good idea.

mrocklin · 2019-03-15T17:03:53Z

In practice I don't think that there is a priority order. Also, I don't think that a prioritization matters. It would matter if there was a dedicated team devoted to solving these issues. In practice people solve whatever problem is most important to their institution.

…

On Fri, Mar 15, 2019 at 9:54 AM Guillaume Eynard-Bontemps < ***@***.***> wrote: So @apatlpo <https://github.com/apatlpo> I'm curious what would be your priority order? And yes we obviously can't tackle everything :). To answer your question from only my personal point of view (so I may be wrong here, and there are probably things I don't quite know yet about Dask governance) : - The input from the user community comes from issues on github and probably other places like Stack Overflow or conferences like SciPy. - Dask team tries to meet every month, but the discussion are really high level decisions and actions. - Development priorities are often opportunistic: core developers or other punctual contributors will work on what they need or feel is useful at the moment. - We have a https://github.com/dask/governance project, but I don't this we discussed how to propose a Roadmap or equivalent yet. But maybe there is already something that exist and I'm not aware of. Maybe we should raise an issue here? And we could propose a poll at the end of the post to see what the readers most want, sounds like a good idea. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszAah9PjuDX19Do-zB69CqisLDo23ks5vW9BCgaJpZM4ZkyFO> .

guillaumeeb · 2019-04-02T20:02:09Z

Should this post be published with the last modifications? I'm personnaly happy with how it looks!

guillaumeeb · 2019-05-24T21:27:35Z

Pinging again this PR, should we advance here?

mrocklin · 2019-05-26T03:51:17Z

I'll add this to my TODO list for next week. That being said, I wouldn't be surprised if it slips off. I give this about a 50% chance of happening. If anyone else wants to manage this that would be welcome.

mrocklin · 2019-06-06T18:29:32Z

I've moved the hackpad text into this PR. I'll plan to publish early next week if there are no further comments.

FYI @jhamman @willirath you both listed notes/TODOs in the hackpad. I suspect that you're too busy to fill them out, but I thought I'd send a reminder.

guillaumeeb

Thanks Matt, just a maybe inappropriate comment.

_posts/2018-12-29-dask-on-hpc.md

apatlpo · 2019-06-11T10:11:30Z

_posts/2018-12-29-dask-on-hpc.md

+---
+{% include JB/setup %}
+
+We analyze large datasets on HPC systems with Dask,


large is a bit vague, can we be more quantitative: 1TB to 100TB?

What I've seen is from 100s GB to 100TB, I'm OK if you want to precise it.

apatlpo · 2019-06-11T10:16:38Z

_posts/2018-12-29-dask-on-hpc.md

+### 6.  Cost and Collaboration
+
+Dask is free and open source, which means we do not have to rebalance our budget and staff to address the new immediate need of data analysis tools.
+We don't have to pay for licenses, and we have the ability to make changes to the code when necessary.  The HPC community has good representation among Dask developers.  It's easy for us to participate and our concerns are well understood.


It's easy for us to participate I don't understand what you mean here

For me it means we can easily ask questions, raise issues or propose some changes

apatlpo · 2019-06-11T10:18:51Z

_posts/2018-12-29-dask-on-hpc.md

+While Dask can theoretically handle this scale, it does tend to slow down a bit,
+reducing the pleasure of interactive large-scale computing. Handling millions of tasks can lead to tens of seconds latency before a computation actually starts.  This is perfectly fine for our Dask batch jobs, but tends to make the interactive Jupyter users frustrated.
+
+Much of this slowdown is due to task-graph construction time and centralized scheduling, both of which can be accelerated through a variety of means.  We expect that, with some cleverness, we can increase the scale at which Dask continues to run smoothly by another order of magnitude.


through a variety of means this is a vague statement. Wouldn't it be worth being a bit more specific?

Not sure this is the goal of this post. I think @mrocklin has some ideas on how Scheduler optimization could be done, but I'm not sure it's important to share them here.

apatlpo · 2019-06-11T10:22:32Z

_posts/2018-12-29-dask-on-hpc.md

+
+HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes.
+
+While Dask can theoretically handle this scale, it does tend to slow down a bit,


slow down a bit
I would consider this is a strong understatement given my experience.
In several occasions, I found myself into situations where the scheduler cannot handle the load and I had to rewrite code in order to account for this (by manually considering subsets of my datasets).

Did you have memory problem on workers, or really it as the scheduler not responding?

I agree that this can be more than a slow down as seen per a user perspective. I've not pushed Dask to its task number limit since several month though, maybe it has improved.

mrocklin · 2019-06-11T12:13:51Z

@guillaumeeb thanks for the recent comments. I was planning to publish this today. Did you want to make changes to address your concerns before I do so? I personally am unlikely to do anything else here.

guillaumeeb · 2019-06-11T20:24:58Z

@mrocklin I think you meant to talk to @apatlpo, my comment was not really useful. Aurélien, could you propose some changes?

mrocklin · 2019-06-11T21:20:25Z

I think you meant to talk to @apatlpo

Ah! My mistake.

mrocklin · 2019-06-11T21:20:48Z

I plan to merge this tomorrow morning. We can still modify it over time.

mrocklin · 2019-06-12T05:10:00Z

Published. Thanks for the work all.

Add draft of dask-on-hpc post

893bf44

mrocklin mentioned this pull request Dec 30, 2018

Dask on HPC, what works and what doesn't #5

Closed

minor changes

9df0aa6

jhamman reviewed Jan 2, 2019

View reviewed changes

apatlpo reviewed Jan 3, 2019

View reviewed changes

guillaumeeb reviewed Jan 3, 2019

View reviewed changes

Move hackpad into github markdown

67b2bfa

guillaumeeb approved these changes Jun 9, 2019

View reviewed changes

_posts/2018-12-29-dask-on-hpc.md Outdated Show resolved Hide resolved

apatlpo reviewed Jun 11, 2019

View reviewed changes

mrocklin added 2 commits June 11, 2019 14:11

minor edit

ebcc3d9

Update date of HPC post

3ccbb09

post goes live

012a9fa

move post to new date

5db190b

mrocklin merged commit d817090 into gh-pages Jun 12, 2019

mrocklin deleted the dask-on-hpc branch June 12, 2019 05:09


		### 4. ~~Batch Launching~~

		This was resolved while we prepared this blogpost. Hooray for open source.


		HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes.

		While Dask can theoretically handle this scale, it does tend to slow down a bit,

Add draft of dask-on-hpc post #6

Add draft of dask-on-hpc post #6

Conversation

mrocklin commented Dec 30, 2018

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apatlpo Jan 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guillaumeeb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jan 4, 2019

mrocklin commented Jan 11, 2019

guillaumeeb commented Jan 14, 2019

mrocklin commented Mar 6, 2019

apatlpo commented Mar 7, 2019

guillaumeeb commented Mar 11, 2019

apatlpo commented Mar 12, 2019

guillaumeeb commented Mar 15, 2019

mrocklin commented Mar 15, 2019 via email

guillaumeeb commented Apr 2, 2019

guillaumeeb commented May 24, 2019

mrocklin commented May 26, 2019

mrocklin commented Jun 6, 2019

guillaumeeb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jun 11, 2019

guillaumeeb commented Jun 11, 2019

mrocklin commented Jun 11, 2019

mrocklin commented Jun 11, 2019

mrocklin commented Jun 12, 2019

apatlpo Jan 3, 2019 •

edited

Loading