Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft of dask-on-hpc post #6

Merged
merged 7 commits into from
Jun 12, 2019
Merged

Add draft of dask-on-hpc post #6

merged 7 commits into from
Jun 12, 2019

Conversation

mrocklin
Copy link
Member

Fixes #5

This is a work in progress. It includes some basic text copied from the issue from responses from @guillaumeeb and @kmpaul along with some filler from myself. I've currently listed everyone active on that issue as an author. Please let me know if you'd like to be added or removed.

Help filling things in here would be appreciated. People should be able to contribute text in the following ways:

  1. Direct commits / PRs (happy to give anyone commit access to my fork
  2. Comments on this PR with suggested text

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments on the current draft.


Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and
so on. This is because Dask works with other libraries within the Python
ecosystem, like XArray, which already has strong support for scientific data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: XArray-->xarray (or Xarray)

completely different way of providing computing infrastructure to our users.
Dask opens up the door to cloud computing technologies (such as elastic scaling
and object storage) and makes us rethink what an HPC center should really look
like!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the correct place but I suggest adding a few simple code examples to highlight the usage pattern of the Cluster managers concept.


Dask is free and open source, which means we do not have to rebalance our
budget and staff to address the new immediate need of data analysis tools.
There are no licenses, and we have the ability to make changes when necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there are licenses, just we just don't have to pay for access to them.

Dask provides a number of profiling tools that give you diagnostics at the
individual task-level, but there is no way today to analyze or profile your
Dask application at a coarse-grained level. Having more tools to analyze bulk
performance would be helpful when making design decisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tangible idea here would be a benchmarking suite that helps users make decisions about how to use dask most effectively.

---
{% include JB/setup %}

We analyze large datasets on HPC systems with Dask, a library for scalability
Copy link

@apatlpo apatlpo Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large datasets is vague, can you be more quantitative (e.g. 1-100TB ...) ?



# What needs work

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, guidelines for adequate strategies or tutorials on prototypical calculations are limited to my opinion and could be one item in this list.

I understand the dashboard is here to help you decide on you own how to adapt your computations. From experience, this is not as simple as it seems and multiple trial and errors is the norm in order to perform large scale calculations (rechunking on O(10TB) data is typical computation in my group for example).

One issue that we have to face typically on HPC platforms is that the dataset does not or does barely fit in memory.
My experience tells me this has profound implications on the strategies you have to develop compared to platforms where the dataset easily fits in memory.

Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some contributions.

and works comfortably with native HPC hardware.

This article explains why this makes sense for us.
Our motivation is to share our experiences with our colleagues,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my motivation is to make HPC users aware of what Dask can offer to them.
Help them identify use cases could be nice too, but probably out of scope here.

Dask extends libraries like Numpy, Pandas, and Scikit-learn, which are well
known APIs for scientists and engineers. It also extends simpler APIs for
multi-node multiprocessing. This makes it easy for our existing user base to
get up to speed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should compare to MPI/OpenMP and even Spark.

"Common solutions for distributed processing on HPC are job arrays or MPI based frameworks. The first fits only simple use cases while tending to generate boiler plate bash scripts, second is very powerful but complicated and more suited to heavy C or Fortran numerical simulations. Dask covers easily the first one and part of the possibility offered by MPI, it is also easier and more complete than PySpark."

All the infrastructure that we need is already in place.

Also, interactive analysis at scale and auto scaling is just awesome, and lets
us use our existing infrastructure in new ways and improves our occupancy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve and optimize occupancy.

Adaptivity can prevent some users to book compute nodes for too long.

Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and
so on. This is because Dask works with other libraries within the Python
ecosystem, like XArray, which already has strong support for scientific data
formats and processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentionned (so I don't know if this has to be writen again, but this is a strong argument in this bullet): Dask extends Numpy, at the heart of the Scientific Python ecosystem.


And yet Dask is not designed for any particular workflow, and instead can
provide infrastructure to cover a variety of different problems within an
institution. Everything is possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should detail a bit here? I guess it depends on the final length of the post...

  • You can easily handle Numpy arrays ir Pandas Dataframes at scale, doing some numerical work or data analysis/cleaning.
  • You can handle any object collections, say from JSON file or any other with Bag,
  • You can express any workflow with Delayed, or parallelize anything with Delayed or Futures.


### 4. ~~Batch Launching~~

*This was resolved while we prepared this blogpost. Hooray for open source.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to keep this section (but this is probably what is intended here). Just add a link to dask-mpi!


### 3. Scheduler Performance on Large Graphs


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part. My impression is that you can block several minutes waiting for a computation to start when submitting 100,000 or millions of tasks. This can be costly when booking several nodes on HPC, but also on cloud environment!

### 5. More Data Formats


### 6. Task and submission history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant with point 2?

@mrocklin
Copy link
Member Author

mrocklin commented Jan 4, 2019

Thanks all for the comments. If I can ask folk to go even further, would people be willing to directly edit this? I'm curious about what the best way is to modify this document as a group. As an experiment I've put things into hackmd.io here: https://hackmd.io/98wCLMyiTM-juP8S4FhTXA . It should be editable after sign-on. Would people be comfortable filling things in directly this way?

@mrocklin
Copy link
Member Author

@jhamman and I have added a bit more content to the hackpad. If others have time to log in, check things out, and modify/add things that would be very welcome.

@guillaumeeb
Copy link
Member

I've added some ideas, this probably needs some rereading though.

@mrocklin
Copy link
Member Author

mrocklin commented Mar 6, 2019

Thanks @guillaumeeb !

I've also just gone through it again to smooth out some language. I would like to publish this soonish.

@jhamman you had some suggestions on where we might place code samples. Would you be willing to add these?

Also, who should we place on the authors list? I'm inclined to include everyone who participated on this issue if there are no objections to that.

@apatlpo
Copy link

apatlpo commented Mar 7, 2019

Would it be worth adding a sense of priority in the section What needs work ?
or does it depend too much on each person's point of view/interest?

@guillaumeeb
Copy link
Member

This looks quite good!

Do we need more code samples? Some pictures? No great idea from me here...

For the priority, I fear this will be to personal, but here's my list : more data formats (need good image format integration), large graphs, deep learning, heterogenous resources, guidelines, diagnostics.

@apatlpo
Copy link

apatlpo commented Mar 12, 2019

I understand priority may be personal but some topics may stand out even after an average over users.

Also, Dask dev team surely does not have that much ressources that they can tackle everything at once.
I'd be curious to know how does the team select development priorities actually.

Is there an input from the user community? If yes, how is this input collected?

A poll could have been helpful to investigate this and the post could have been an opportunity to show its result.

@guillaumeeb
Copy link
Member

So @apatlpo I'm curious what would be your priority order?

And yes we obviously can't tackle everything :). To answer your question from only my personal point of view (so I may be wrong here, and there are probably things I don't quite know yet about Dask governance) :

  • The input from the user community comes from issues on github and probably other places like Stack Overflow or conferences like SciPy.
  • Dask team tries to meet every month, but the discussion are really high level decisions and actions.
  • Development priorities are often opportunistic: core developers or other punctual contributors will work on what they need or feel is useful at the moment.
  • We have a https://github.com/dask/governance project, but I don't this we discussed how to propose a Roadmap or equivalent yet. But maybe there is already something that exist and I'm not aware of. Maybe we should raise an issue here?

And we could propose a poll at the end of the post to see what the readers most want, sounds like a good idea.

@mrocklin
Copy link
Member Author

mrocklin commented Mar 15, 2019 via email

@guillaumeeb
Copy link
Member

Should this post be published with the last modifications? I'm personnaly happy with how it looks!

@guillaumeeb
Copy link
Member

Pinging again this PR, should we advance here?

@mrocklin
Copy link
Member Author

I'll add this to my TODO list for next week. That being said, I wouldn't be surprised if it slips off. I give this about a 50% chance of happening. If anyone else wants to manage this that would be welcome.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 6, 2019

I've moved the hackpad text into this PR. I'll plan to publish early next week if there are no further comments.

FYI @jhamman @willirath you both listed notes/TODOs in the hackpad. I suspect that you're too busy to fill them out, but I thought I'd send a reminder.

Copy link
Member

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt, just a maybe inappropriate comment.

_posts/2018-12-29-dask-on-hpc.md Outdated Show resolved Hide resolved
---
{% include JB/setup %}

We analyze large datasets on HPC systems with Dask,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large is a bit vague, can we be more quantitative: 1TB to 100TB?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've seen is from 100s GB to 100TB, I'm OK if you want to precise it.

### 6. Cost and Collaboration

Dask is free and open source, which means we do not have to rebalance our budget and staff to address the new immediate need of data analysis tools.
We don't have to pay for licenses, and we have the ability to make changes to the code when necessary. The HPC community has good representation among Dask developers. It's easy for us to participate and our concerns are well understood.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy for us to participate I don't understand what you mean here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it means we can easily ask questions, raise issues or propose some changes

While Dask can theoretically handle this scale, it does tend to slow down a bit,
reducing the pleasure of interactive large-scale computing. Handling millions of tasks can lead to tens of seconds latency before a computation actually starts. This is perfectly fine for our Dask batch jobs, but tends to make the interactive Jupyter users frustrated.

Much of this slowdown is due to task-graph construction time and centralized scheduling, both of which can be accelerated through a variety of means. We expect that, with some cleverness, we can increase the scale at which Dask continues to run smoothly by another order of magnitude.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

through a variety of means this is a vague statement. Wouldn't it be worth being a bit more specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the goal of this post. I think @mrocklin has some ideas on how Scheduler optimization could be done, but I'm not sure it's important to share them here.


HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes.

While Dask can theoretically handle this scale, it does tend to slow down a bit,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slow down a bit
I would consider this is a strong understatement given my experience.
In several occasions, I found myself into situations where the scheduler cannot handle the load and I had to rewrite code in order to account for this (by manually considering subsets of my datasets).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have memory problem on workers, or really it as the scheduler not responding?

I agree that this can be more than a slow down as seen per a user perspective. I've not pushed Dask to its task number limit since several month though, maybe it has improved.

@mrocklin
Copy link
Member Author

@guillaumeeb thanks for the recent comments. I was planning to publish this today. Did you want to make changes to address your concerns before I do so? I personally am unlikely to do anything else here.

@guillaumeeb
Copy link
Member

@mrocklin I think you meant to talk to @apatlpo, my comment was not really useful. Aurélien, could you propose some changes?

@mrocklin
Copy link
Member Author

I think you meant to talk to @apatlpo

Ah! My mistake.

@mrocklin
Copy link
Member Author

I plan to merge this tomorrow morning. We can still modify it over time.

@mrocklin mrocklin merged commit d817090 into gh-pages Jun 12, 2019
@mrocklin mrocklin deleted the dask-on-hpc branch June 12, 2019 05:09
@mrocklin
Copy link
Member Author

Published. Thanks for the work all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dask on HPC, what works and what doesn't
4 participants