-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2P shuffle questions #5939
Comments
There was an attempt at using a SchedulerPlugin for synchronization instead of relying worker RPC calls. That was pretty cool and would offer an alley to potentially fix other problems. However, at the same time it either required us to parse keys or introduce special annotations. In the end we chose the current approach for simplicity. All of the RPC stuff in the tasks could be easily moved to a plugin once we figured out how to filter/select the keys efficiently See also #5524
yup
nope |
See #5524 and the extensive discussion there, including #5524 (comment) and the comment above for why we didn't end up merging that PR instead (even though in principle it worked better, and is the approach we'll need to switch to).
Yes (assuming by "sync" you mean "in a thread"; it is called right now within a task in distributed/distributed/shuffle/shuffle_extension.py Lines 85 to 87 in 936fba5
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randint(1, 1000, 100_000))
In [4]: %load_ext ptime
In [5]: %ptime -n 2 list(df.groupby(0))
Total serial time: 0.07 s
Total parallel time: 0.06 s
For a 1.19X speedup across 2 threads
In [6]: %ptime -n 4 list(df.groupby(0))
Total serial time: 0.14 s
Total parallel time: 0.16 s
For a 0.86X speedup across 4 threads
In [7]: %ptime -n 8 list(df.groupby(0))
Total serial time: 0.29 s
Total parallel time: 0.26 s
For a 1.14X speedup across 8 threads
dask/dask#8392 did that and we closed it. I'd object to adding it right now, since there's no reason why anyone should use it currently. It's slower than a task shuffle, extremely non-resilient, and is all in-memory. Once it's useful to an end user, then it should be added.
I think there are some other issues (I remember coming up with a situation in which our resilience strategy wouldn't work at some point), but I don't remember them off the top of my head. |
Heads up, I've started looking into this problem again, hence the recent questions
There is an in-between state where we Also, I'm looking into using arrow sort_by/slice to do this outside of pandas
Makes sense
Yup, that'll be interesting 🙂 I was chatting with Florian and he brought up network backpressure. My suggestion for that is ...
|
Howdy folks, I took a look through the recent P2P shuffle implemention. I had a few questions and thoughts. I suspect that most of this is for @gjoseph92 and @fjetter
Just making sure I understand things
So we've taken the structure in the POC and ...
Is this correct? Am I missing other large changes?
Some random thoughts
These are just thoughts for future improvements as we evolve. I'm looking for "yup, that's maybe a good idea" or "nope, that's probably dumb because of x, y, z"
"p2p"
to Dask as an option there?Future efforts
I see two major efforts:
Assuming that we had magical solutions to both of those problems, what else is missing?
The text was updated successfully, but these errors were encountered: