Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single File Per ParquetExec, AvroExec, etc... #2293

Open
tustvold opened this issue Apr 20, 2022 · 4 comments
Open

Single File Per ParquetExec, AvroExec, etc... #2293

tustvold opened this issue Apr 20, 2022 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tustvold
Copy link
Contributor

tustvold commented Apr 20, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Part of #2079

Following on from #2292 and #2291 it should be possible to pull the multi-file handling out of each individual file operator, and delegate it to the physical plan. As described in #2079 this will greatly simplify the implementations, whilst also hiding fewer details from the physical plan.

Describe the solution you'd like

Currently a FileScanConfig would result ListingTable::scan generating a physical plan that looks something like

ParquetExec

I propose instead generating something like

UnionExec
  ProjectionExec: ... // Partition 1
    UnionExec
        ParquetExec: ... // Partition 1 File 1
        ParquetExec: ... // Partition 1 File 2
  ProjectionExec: ... // Partition 2
    UnionExec
        ParquetExec: ... // Partition 2 File 1
        ParquetExec: ... // Partition 2 File 2
        ParquetExec: ... // Partition 2 File 3

Whilst this is more complex, it results in less complexity in the file format operators, and should hopefully lead to less bugs due to things like #2170 or #2000

Describe alternatives you've considered

We could not do this

FYI @thinkharderdev @matthewmturner

@tustvold tustvold added enhancement New feature or request help wanted Extra attention is needed labels Apr 20, 2022
@yjshen
Copy link
Member

yjshen commented Apr 21, 2022

Do you mean PartitionedFile for File, but removing the partition_values field?

pub struct PartitionedFile {
    /// Path for the file (e.g. URL, filesystem path, etc)
    pub file_meta: FileMeta,
    /// Values of partition columns to be appended to each row
    pub partition_values: Vec<ScalarValue>,
    /// An optional file range for a more fine-grained parallel execution
    pub range: Option<FileRange>,
}

Another question: regarding we do a one-to-one mapping between Spark physical plan and DataFusion physical plan, and serde the plan with proto, is this possible we do this Original ParquetExec -> UnionExec(SchemaAdapterExec(New ParquetExec)) in ParquetExec's try_new method or somewhere related place in the physical plan?

@tustvold
Copy link
Contributor Author

Do you mean PartitionedFile for File, but removing the partition_values field?

Yes, although removing the partition_values is likely follow up work

in ParquetExec's try_new method or somewhere related place in the physical plan?

I would rather keep the translation logic out of the file format specific operators, but having a free function that can be called by ListingTable and potentially other things, such as your Spark translation layer, seems perfectly sensible to me. I just care about reducing the amount of smarts in the individual file format specific operators 😄

@yjshen
Copy link
Member

yjshen commented Apr 21, 2022

having a free function that can be called by ListingTable and potentially other things, such as your Spark translation layer, seems perfectly sensible to me

Sounds great!

@thinkharderdev
Copy link
Contributor

This sounds like a great idea. The serial file processing in ParquetExec is currently a pretty nasty bottleneck for queries over large numbers of parquet files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants