-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: More Granular File Operators #2079
Comments
I like this idea (though I am somewhat biased, as I like / implemented a bunch of the IOx approach) :) |
Thanks for bringing it up. Recently, we've encountered several different circumstances to deal with query execution parallelism:
I think it might be the right time to rethink how to divide query working sets, how to partition/repartition data among operators, and how we should schedule tasks. My opinion on this whole scheduler and execution framework is: (partly proposed in #1987 (comment) and in our roadmap)
By the method proposed above, we could also achieve:
with the help of the TaskScheduler, and achieve:
by an existing, sequentially executed, Task. |
I agree entirely that the current scheduling within DataFusion, which effectively punts onto tokio and hopes it does something vaguely appropriate, is likely sub-optimal. In fact one of the issues I'm having with #1617 appears to be that sub-optimal task scheduling is causing a 2x performance regression. That being said, I think there are a number of problems here and it would be advantageous in my mind to keep them separate:
This ticket is then concerned solely with the first of these, and reducing the complexity of the file-format specific operators necessary to achieve this translation. In particular removing the ability to scan multiple files within a single operator, and instead composing multiple per-file operators in the generated Does this make sense, or am I missing something? FWIW we sort of already do the stage-based execution you describe, but it is implicit based on whether operators call |
Sounds great. We could eliminate much of the common logic scattered in different file formats, as you mentioned. And yes, this makes sense!
I'm confused here: How would we parallelize the physical operators after this
What do you think if we avoid |
You would have multiple single-partition
Let me get back to you on this, it is something I am currently mulling about and experimenting with. I agree that using async for CPU-bound work seems a little wonky, but as @alamb articulated here there are reasons that it may be the pragmatic choice. I'm trying to collect some data so we can make an informed decision 😅 FWIW as you link to the morsel driven paper - what you describe I think is closer to the more traditional plan-driven parallelism than morsel-driven parallelism. Tokio is much closer to that paper than what you describe as it incorporates notions of dynamic scheduling and work-stealing, rayon may be even closer |
Very much looking forward to it.
I think work-stealing in Morsel-driven and that in Tokio are quite different things. Having a rough partition of the whole dataset at the beginning, and stealing part of data from the skewed partition to idle working slots or CPU cores later is quite different from task/green thread stealing for Tokio. Or do I miss something crucial that one SendableRecordBatchStream can be parallel processed by multiple tokio tasks? 🤔 |
Depends, the SendableRecordBatchStream itself can only be processed by a single tokio task correct, however, there is nothing to prevent that stream from actually being an mpsc channel with the actual work performed in other tasks in parallel. In fact this is exactly what CoalescePartitionsExec does, and the physical optimizer will add combinations of RepartitionExec and CoalescePartitionsExec to plans based on the target_partitions setting. Whilst target_partitions is typically set to the CPU thread count, RepartitionExec will typically appear multiple times in a given plan, and so this will result in more tasks than CPU cores. If there are other queries running concurrently, or the target_partitions is set higher, this will be even more pronounced. If you now squint, this is a first-order approximation of morsel-driven. It's far from perfect, the tokio scheduler is not in anyway NUMA-aware and in fact it optimises for load-distribution at the expense of thread-locality, but it is not hugely dissimilar. At least that's my hand-wavy argument 😆 I happen to think rayon is closer in spirit, but I'm not sure how much of a difference that will make in practice. |
As promised, the scheduling proposal can be found in #2199 |
Ok I've created some sub-tasks that I think should provide a path towards implementing this proposal:
I'm currently somewhat over-subscribed getting #2199 over the line, so if someone was free and willing to help out that would be awesome. If not I'll get to them when I have a chance. |
@tustvold thank you for driving this. i do hope to getting around to help on this but i have been a bit short of time lately. will keep you posted if im able to pick any of this up. |
I think this work is now encapsulated in smaller sub-tasks, so closing this issue |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently the file scan operators such as
ParquetExec
,CsvExec
, etc... are created with aFileScanConfig
, which internally contains a list ofPartitionedFile
. ThesePartitionedFile
are provided grouped together in "file groups". For each of these groups, the operators expose a DataFusion partition which will scan these files sequentially.Whilst this works, I'm getting a little bit concerned we are ending up with quite a lot of complexity within each of the individual, file-format specific operators:
This in turn comes with some downsides:
Describe the solution you'd like
It isn't a fully formed proposal, but I wonder if instead of continuing to extend the individual file format operators we might instead compose together simpler physical operators within the query plan. Specifically I wonder if we might make it so that the
ParquetExec
,CsvExec
operators handle a single file, and the plan stage withinTableProvider::scan
instead constructs a more complex physical plan containing the necessaryProjectionExec
,SchemaAdapter
operators as necessary.For what it is worth, IOx uses a similar approach https://github.com/influxdata/influxdb_iox/blob/main/query/src/provider.rs#L282 and it works quite well.
Describe alternatives you've considered
The current approach could remain
Additional context
I'm trying to take a more holistic view on what the parquet interface upstream should look like, which is somewhat related to this apache/arrow-rs#1473
FYI @rdettai @yjshen @alamb
The text was updated successfully, but these errors were encountered: