-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallelize window function evaluations #546
Conversation
Codecov Report
@@ Coverage Diff @@
## master #546 +/- ##
==========================================
- Coverage 76.13% 76.13% -0.01%
==========================================
Files 156 156
Lines 27032 27024 -8
==========================================
- Hits 20582 20574 -8
Misses 6450 6450
Continue to review full report at Codecov.
|
maybe a better way is to use rayon |
In my opinion we should try to stay away from Rayon and probably also should stay away from introducing parallelism at a low level (per expression) as it will likely add quite some overhead (in terms of extra threads, memory allocations hold on to and Rayon scheduling overhead). Or we should have some convincing benchmarks / reasoning why at a certain part it is reasonable to introduce parallelism using tool x. I think for other parts of the code we could use Tokios On the general level I think we mostly should check whether we perform parallelism at a partition / file level and introduce enough parallelism in the query plan (e.g. by using partitioning https://medium.com/@danilheres/increasing-the-level-of-parallelism-in-datafusion-4-0-d2a15b5a2093 ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach is fine for now -- using something like rayon would just add more thread contention (as it has its own thread pool that is separate from the tokio thread pool). I think this approach is 👍
As mentioned elsewhere as we get more sophisticated in DataFusion, adding a task scheduler to manage these tasks (e.g. so we aren't thrashing, for example) is probably a good thing
Just to be super clear, I am not suggesting we add a task scheduler as part of adding window functions -- I was trying to say that I felt following the existing pattern of using |
1c4938a
to
7b5445f
Compare
7b5445f
to
59894d9
Compare
@jimexist FYI just give the word when you think this one is ready to review / merge (it is still marked as Draft so I am not sure) |
benchmark run:
i suspect that this requires more sample where multiple window functions are planned in the same phase. |
@Dandandan and @alamb what do you think of the above results? I think we can either merge or close this - and wait for that customized scheuduler |
In my view, creating tasks this small is unlikely to be of benefit for query execution. Note that spawn blocking will create new threads on demand (512 by default), which will lead to higher memory use, and additional CPU usage / context switching (which can hurt performance of other parts of the query execution). The spawn blocking is only meant for longer running tasks. I am not sure the customized scheduler will handle tasks this small. I think much larger gains at this moment can be achieved with more efficient implementations and parallization at a much higher level. |
I agree with @Dandandan 's conclusion |
thanks for the comment, i agree with the assessment above, plus that in most cases the number of window functions within a logically planned phase should be 1 or 2, there's little point to parallize. i think #569 is more promising |
59894d9
to
ed55aa2
Compare
Which issue does this PR close?
Closes #.
Rationale for this change
parallelize window function evaluations using tokio
What changes are included in this PR?
Are there any user-facing changes?