Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate partitions columns in CREATE EXTERNAL TABLE if table already exists. #9912

Merged
merged 14 commits into from
Apr 5, 2024

Conversation

MohamedAbdeen21
Copy link
Contributor

Which issue does this PR close?

Closes #9785 .

Rationale for this change

Prevent specifying wrong partitions when creating external tables that already exist on disk. This stops a whole suite of problems with other statements (check the issue for examples) that expect the table partitions to match the path partitions.

What changes are included in this PR?

CREATE EXTERNAL TABLE now infers the partition columns provided in LOCATION and compares it with the columns provided in PARTITIONED BY clause. Partial partitions are allowed, and does nothing if path is empty.

Are these changes tested?

Only through slt tests, no unit tests.

Are there any user-facing changes?

Better errors and safeguards when dealing with partitioned external tables.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 2, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MohamedAbdeen21 -- this looks really nice and will be a better user experience.

I think there are a few more tests to write but otherwise this is looking quite good.

cc @devinjdangelo and @metesynnada FYI

}

/// Infer the partitioning at the given path on the provided object store.
/// For performance reasons, it doesn't read all the files on disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

datafusion/core/src/datasource/listing/table.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 4, 2024

Marking this as a draft as I think we are just waiting on some more tests. @MohamedAbdeen21 please let me know if that was not correct or if you need some help finishing this PR up.

@alamb alamb marked this pull request as draft April 4, 2024 19:07
@MohamedAbdeen21
Copy link
Contributor Author

Hi @alamb, I've implemented the check we discussed. Apart from adding unit tests for the two new functions, I believe the PR is now adequately covered.

I'm curious to see if anyone can come up with additional queries that should be tested.

@alamb alamb marked this pull request as ready for review April 4, 2024 20:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @MohamedAbdeen21 -- this looks awesome

.rev()
.skip(1) // get parents only; skip the file itself
.rev()
.map(|s| s.split('=').take(1).collect())
Copy link
Contributor

@Jefffrey Jefffrey Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the partition column name containing a = a cheeky edge case we need to consider here?

Edit: related #9269 (comment)

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to consider this. Now that you mention it, I realize that I never actually thought about how these column names are written to disk. I'll have to take a look on how other systems do it before suggesting anything.

After looking at the linked issue, I think this should be addressed in a follow-up since it may require changing other parts of DF first.

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

Thanks for the review @Jefffrey -- @MohamedAbdeen21 please let me know if you would like to address @Jefffrey 's comments i this PR or a follow on one.

@MohamedAbdeen21
Copy link
Contributor Author

MohamedAbdeen21 commented Apr 5, 2024

I'm not 100% sure.

I've never thought about the cases mentioned in #9269 before to be honest. Maybe do follow-ups since it looks like we may need other changes around sqlparser and DF first.

@Jefffrey
Copy link
Contributor

Jefffrey commented Apr 5, 2024

I'm not 100% sure.

I've never thought about the cases mentioned in #9269 before to be honest. Maybe do follow-ups since it looks like we may need other changes around sqlparser and DF first.

Yes this sounds good 👍

@alamb alamb merged commit a29f2be into apache:main Apr 5, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

Thanks again @MohamedAbdeen21 and @Jefffrey -- each PR gets us that much better 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Panic when querying table with wrong partition columns order
3 participants