-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add draft of dask-on-hpc post #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments on the current draft.
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and | ||
so on. This is because Dask works with other libraries within the Python | ||
ecosystem, like XArray, which already has strong support for scientific data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: XArray-->xarray (or Xarray)
_posts/2018-12-29-dask-on-hpc.md
Outdated
completely different way of providing computing infrastructure to our users. | ||
Dask opens up the door to cloud computing technologies (such as elastic scaling | ||
and object storage) and makes us rethink what an HPC center should really look | ||
like! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the correct place but I suggest adding a few simple code examples to highlight the usage pattern of the Cluster managers concept.
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
Dask is free and open source, which means we do not have to rebalance our | ||
budget and staff to address the new immediate need of data analysis tools. | ||
There are no licenses, and we have the ability to make changes when necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there are licenses, just we just don't have to pay for access to them.
_posts/2018-12-29-dask-on-hpc.md
Outdated
Dask provides a number of profiling tools that give you diagnostics at the | ||
individual task-level, but there is no way today to analyze or profile your | ||
Dask application at a coarse-grained level. Having more tools to analyze bulk | ||
performance would be helpful when making design decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tangible idea here would be a benchmarking suite that helps users make decisions about how to use dask most effectively.
_posts/2018-12-29-dask-on-hpc.md
Outdated
--- | ||
{% include JB/setup %} | ||
|
||
We analyze large datasets on HPC systems with Dask, a library for scalability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
large datasets
is vague, can you be more quantitative (e.g. 1-100TB ...) ?
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
|
||
# What needs work | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, guidelines for adequate strategies or tutorials on prototypical calculations are limited to my opinion and could be one item in this list.
I understand the dashboard is here to help you decide on you own how to adapt your computations. From experience, this is not as simple as it seems and multiple trial and errors is the norm in order to perform large scale calculations (rechunking on O(10TB) data is typical computation in my group for example).
One issue that we have to face typically on HPC platforms is that the dataset does not or does barely fit in memory.
My experience tells me this has profound implications on the strategies you have to develop compared to platforms where the dataset easily fits in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some contributions.
_posts/2018-12-29-dask-on-hpc.md
Outdated
and works comfortably with native HPC hardware. | ||
|
||
This article explains why this makes sense for us. | ||
Our motivation is to share our experiences with our colleagues, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my motivation is to make HPC users aware of what Dask can offer to them.
Help them identify use cases could be nice too, but probably out of scope here.
_posts/2018-12-29-dask-on-hpc.md
Outdated
Dask extends libraries like Numpy, Pandas, and Scikit-learn, which are well | ||
known APIs for scientists and engineers. It also extends simpler APIs for | ||
multi-node multiprocessing. This makes it easy for our existing user base to | ||
get up to speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should compare to MPI/OpenMP and even Spark.
"Common solutions for distributed processing on HPC are job arrays or MPI based frameworks. The first fits only simple use cases while tending to generate boiler plate bash scripts, second is very powerful but complicated and more suited to heavy C or Fortran numerical simulations. Dask covers easily the first one and part of the possibility offered by MPI, it is also easier and more complete than PySpark."
_posts/2018-12-29-dask-on-hpc.md
Outdated
All the infrastructure that we need is already in place. | ||
|
||
Also, interactive analysis at scale and auto scaling is just awesome, and lets | ||
us use our existing infrastructure in new ways and improves our occupancy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improve and optimize occupancy.
Adaptivity can prevent some users to book compute nodes for too long.
_posts/2018-12-29-dask-on-hpc.md
Outdated
Dask is compatible with scientific data formats like HDF5, NetCDF, Parquet, and | ||
so on. This is because Dask works with other libraries within the Python | ||
ecosystem, like XArray, which already has strong support for scientific data | ||
formats and processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already mentionned (so I don't know if this has to be writen again, but this is a strong argument in this bullet): Dask extends Numpy, at the heart of the Scientific Python ecosystem.
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
And yet Dask is not designed for any particular workflow, and instead can | ||
provide infrastructure to cover a variety of different problems within an | ||
institution. Everything is possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should detail a bit here? I guess it depends on the final length of the post...
- You can easily handle Numpy arrays ir Pandas Dataframes at scale, doing some numerical work or data analysis/cleaning.
- You can handle any object collections, say from JSON file or any other with Bag,
- You can express any workflow with Delayed, or parallelize anything with Delayed or Futures.
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
### 4. ~~Batch Launching~~ | ||
|
||
*This was resolved while we prepared this blogpost. Hooray for open source.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to keep this section (but this is probably what is intended here). Just add a link to dask-mpi!
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
### 3. Scheduler Performance on Large Graphs | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part. My impression is that you can block several minutes waiting for a computation to start when submitting 100,000 or millions of tasks. This can be costly when booking several nodes on HPC, but also on cloud environment!
_posts/2018-12-29-dask-on-hpc.md
Outdated
### 5. More Data Formats | ||
|
||
|
||
### 6. Task and submission history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant with point 2?
Thanks all for the comments. If I can ask folk to go even further, would people be willing to directly edit this? I'm curious about what the best way is to modify this document as a group. As an experiment I've put things into hackmd.io here: https://hackmd.io/98wCLMyiTM-juP8S4FhTXA . It should be editable after sign-on. Would people be comfortable filling things in directly this way? |
@jhamman and I have added a bit more content to the hackpad. If others have time to log in, check things out, and modify/add things that would be very welcome. |
I've added some ideas, this probably needs some rereading though. |
Thanks @guillaumeeb ! I've also just gone through it again to smooth out some language. I would like to publish this soonish. @jhamman you had some suggestions on where we might place code samples. Would you be willing to add these? Also, who should we place on the authors list? I'm inclined to include everyone who participated on this issue if there are no objections to that. |
Would it be worth adding a sense of priority in the section |
This looks quite good! Do we need more code samples? Some pictures? No great idea from me here... For the priority, I fear this will be to personal, but here's my list : more data formats (need good image format integration), large graphs, deep learning, heterogenous resources, guidelines, diagnostics. |
I understand priority may be personal but some topics may stand out even after an average over users. Also, Dask dev team surely does not have that much ressources that they can tackle everything at once. Is there an input from the user community? If yes, how is this input collected? A poll could have been helpful to investigate this and the post could have been an opportunity to show its result. |
So @apatlpo I'm curious what would be your priority order? And yes we obviously can't tackle everything :). To answer your question from only my personal point of view (so I may be wrong here, and there are probably things I don't quite know yet about Dask governance) :
And we could propose a poll at the end of the post to see what the readers most want, sounds like a good idea. |
In practice I don't think that there is a priority order. Also, I don't
think that a prioritization matters. It would matter if there was a
dedicated team devoted to solving these issues. In practice people solve
whatever problem is most important to their institution.
…On Fri, Mar 15, 2019 at 9:54 AM Guillaume Eynard-Bontemps < ***@***.***> wrote:
So @apatlpo <https://github.com/apatlpo> I'm curious what would be your
priority order?
And yes we obviously can't tackle everything :). To answer your question
from only my personal point of view (so I may be wrong here, and there are
probably things I don't quite know yet about Dask governance) :
- The input from the user community comes from issues on github and
probably other places like Stack Overflow or conferences like SciPy.
- Dask team tries to meet every month, but the discussion are really
high level decisions and actions.
- Development priorities are often opportunistic: core developers or
other punctual contributors will work on what they need or feel is useful
at the moment.
- We have a https://github.com/dask/governance project, but I don't
this we discussed how to propose a Roadmap or equivalent yet. But maybe
there is already something that exist and I'm not aware of. Maybe we should
raise an issue here?
And we could propose a poll at the end of the post to see what the readers
most want, sounds like a good idea.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszAah9PjuDX19Do-zB69CqisLDo23ks5vW9BCgaJpZM4ZkyFO>
.
|
Should this post be published with the last modifications? I'm personnaly happy with how it looks! |
Pinging again this PR, should we advance here? |
I'll add this to my TODO list for next week. That being said, I wouldn't be surprised if it slips off. I give this about a 50% chance of happening. If anyone else wants to manage this that would be welcome. |
I've moved the hackpad text into this PR. I'll plan to publish early next week if there are no further comments. FYI @jhamman @willirath you both listed notes/TODOs in the hackpad. I suspect that you're too busy to fill them out, but I thought I'd send a reminder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt, just a maybe inappropriate comment.
_posts/2018-12-29-dask-on-hpc.md
Outdated
--- | ||
{% include JB/setup %} | ||
|
||
We analyze large datasets on HPC systems with Dask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
large is a bit vague, can we be more quantitative: 1TB to 100TB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I've seen is from 100s GB to 100TB, I'm OK if you want to precise it.
_posts/2018-12-29-dask-on-hpc.md
Outdated
### 6. Cost and Collaboration | ||
|
||
Dask is free and open source, which means we do not have to rebalance our budget and staff to address the new immediate need of data analysis tools. | ||
We don't have to pay for licenses, and we have the ability to make changes to the code when necessary. The HPC community has good representation among Dask developers. It's easy for us to participate and our concerns are well understood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy for us to participate
I don't understand what you mean here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it means we can easily ask questions, raise issues or propose some changes
_posts/2018-12-29-dask-on-hpc.md
Outdated
While Dask can theoretically handle this scale, it does tend to slow down a bit, | ||
reducing the pleasure of interactive large-scale computing. Handling millions of tasks can lead to tens of seconds latency before a computation actually starts. This is perfectly fine for our Dask batch jobs, but tends to make the interactive Jupyter users frustrated. | ||
|
||
Much of this slowdown is due to task-graph construction time and centralized scheduling, both of which can be accelerated through a variety of means. We expect that, with some cleverness, we can increase the scale at which Dask continues to run smoothly by another order of magnitude. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
through a variety of means
this is a vague statement. Wouldn't it be worth being a bit more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the goal of this post. I think @mrocklin has some ideas on how Scheduler optimization could be done, but I'm not sure it's important to share them here.
_posts/2018-12-29-dask-on-hpc.md
Outdated
|
||
HPC users want to analyze Petabyte datasets on clusters of thousands of large nodes. | ||
|
||
While Dask can theoretically handle this scale, it does tend to slow down a bit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow down a bit
I would consider this is a strong understatement given my experience.
In several occasions, I found myself into situations where the scheduler cannot handle the load and I had to rewrite code in order to account for this (by manually considering subsets of my datasets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have memory problem on workers, or really it as the scheduler not responding?
I agree that this can be more than a slow down as seen per a user perspective. I've not pushed Dask to its task number limit since several month though, maybe it has improved.
@guillaumeeb thanks for the recent comments. I was planning to publish this today. Did you want to make changes to address your concerns before I do so? I personally am unlikely to do anything else here. |
Ah! My mistake. |
I plan to merge this tomorrow morning. We can still modify it over time. |
Published. Thanks for the work all. |
Fixes #5
This is a work in progress. It includes some basic text copied from the issue from responses from @guillaumeeb and @kmpaul along with some filler from myself. I've currently listed everyone active on that issue as an author. Please let me know if you'd like to be added or removed.
Help filling things in here would be appreciated. People should be able to contribute text in the following ways: