Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort filenames when reading parquet to ensure consistent schema #6629

Merged
merged 12 commits into from
Dec 11, 2023

Conversation

thomas-k-cameron
Copy link
Contributor

@thomas-k-cameron thomas-k-cameron commented Jun 10, 2023

Which issue does this PR close?

Closes #5779.

Rationale for this change

Sorting is passed to ensure that files returned is in the same order when the OS is different.

What changes are included in this PR?

This PR adds a lines of code to FileFormat::infer_schema to ParquetFormat; It sorts parquet file by it's location.

On test_util::store_parquet is changed to sort temp file by it's path and and write batches in the sorted order.

This change was introduced as it made the some tests, such as datasource::file_format::parquet::tests::read_merged_batches to fail.

It appears that these tests assumed that infer_schema processed the files in the same order that store_parquet's returning value files.

Are these changes tested?

Yes. This PR introduces new test case async fn issue_5779().

Are there any user-facing changes?

The order of Fields may change.

@github-actions github-actions bot added the core Core DataFusion crate label Jun 10, 2023
@alamb
Copy link
Contributor

alamb commented Jun 13, 2023

Thanks @thomas-k-cameron -- given my reading of #5779 it seems like perhaps a more direct solution might be to sort the list of files so they are always processed in the same order

Perhaps somehow figure out how to sort the objects by the location field, for example?

        let schemas: Vec<_> = futures::stream::iter(objects)
            .map(|object| fetch_schema(store.as_ref(), object, self.metadata_size_hint))
            .boxed() // Workaround https://github.com/rust-lang/rust/issues/64552
            .buffered(SCHEMA_INFERENCE_CONCURRENCY)
            .try_collect()
            .await?;

@thomas-k-cameron
Copy link
Contributor Author

Thank you for your advice.

I couldn't figure out what I was getting it wrong.
Let me try it out.

@tustvold tustvold changed the base branch from main to backup-main June 27, 2023 11:15
@tustvold tustvold changed the base branch from backup-main to main June 27, 2023 11:15
@thomas-k-cameron thomas-k-cameron force-pushed the filesystem-5779 branch 2 times, most recently from 1cf246b to 2d0f9bb Compare June 28, 2023 17:39
@thomas-k-cameron
Copy link
Contributor Author

Let me give you an update.

I tried to sort by location, but it hasn't been working.
I end up getting error from tests for fetching statistics.

It seems like this is coming from how parquet files for tests are created with random names.

I have few other ideas. I will see if it works out.

@thomas-k-cameron thomas-k-cameron force-pushed the filesystem-5779 branch 2 times, most recently from da1c922 to ac00cfa Compare December 10, 2023 04:52
@thomas-k-cameron
Copy link
Contributor Author

thomas-k-cameron commented Dec 10, 2023

@alamb
It's been a while but I believe it is ready for review.

I updated the PR comment to reflect my changes.

@thomas-k-cameron thomas-k-cameron marked this pull request as ready for review December 11, 2023 03:12
@alamb alamb changed the title Sort fields of schema Sort filenames when reading parquet to ensure consistent schema Dec 11, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @thomas-k-cameron -- I think this looks great. The only potential downside of this change I can think of is that this will change behavior for any object store that lists files sorted deterministicially in a different order than sort()

However, I think most object stores sort their locations lexographically so this PR is likely to simply make local filesystems consistent.

Thanks again

@thomas-k-cameron
Copy link
Contributor Author

thomas-k-cameron commented Dec 11, 2023

@alamb

Thank you very much for your feedback and approval.

Change that I just pushed simply applies a cargo fmt to the codebase; I think all tests should pass now.

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Thanks @thomas-k-cameron -- once the check pass I'll plan to merge this PR.

@alamb alamb merged commit 391f301 into apache:main Dec 11, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

🚀

@thomas-k-cameron thomas-k-cameron deleted the filesystem-5779 branch December 11, 2023 15:43
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
…he#6629)

* update

* FIXed

* add  parquet

* update

* update

* update

* try2

* update

* update

* Add comments

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 3, 2024
…he#6629)

* update

* FIXed

* add  parquet

* update

* update

* update

* try2

* update

* update

* Add comments

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 3, 2024
…he#6629)

* update

* FIXed

* add  parquet

* update

* update

* update

* try2

* update

* update

* Add comments

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different schemas when inferring from local system in different OS
2 participants