Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make ListingTableUrl allow direct construction #12981

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi
Copy link
Contributor

@OussamaSaoudi OussamaSaoudi commented Oct 17, 2024

Which issue does this PR close?

Closes #12581

Rationale for this change

Users of datafusion may have object_store paths. However, the current read_parquet forces users to format their paths as a string only to be re-parsed into the original paths. Here I provide a way for datafusion users to simply path an object_store path that they want to use.

What changes are included in this PR?

This PR allows users to construct ListingTableUrl directly with a Path instead of getting the path by parsing the Url path. This is useful when the user already has an object_store::path::Path, and doesn't want to convert their path to string only for it to be parsed again in try_new.

This PR also changes the impl<P> for DataFilePaths where P: AsRef<str>. Previously, passing a Vec<ListingTableUrl> to read_parquet would convert the ListingTableUrl to &str and parse it back to Vec<ListingTableUrl>. This causes issues with escape characters. Instead, I split the trait impl into cases for Vec<&str> and Vec<ListingTableUrl>.

Are these changes tested?

I add one test to make sure that passing Vec<ListingTableUrl> works as expected, and does not cause errors due to escape characters.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 17, 2024
@OussamaSaoudi OussamaSaoudi changed the title Listing url constructor Make ListingTableUrl allow direct construction Oct 17, 2024
@OussamaSaoudi OussamaSaoudi changed the title Make ListingTableUrl allow direct construction [WIP] Make ListingTableUrl allow direct construction Oct 17, 2024
@OussamaSaoudi OussamaSaoudi marked this pull request as ready for review October 24, 2024 07:25
@OussamaSaoudi OussamaSaoudi changed the title [WIP] Make ListingTableUrl allow direct construction Make ListingTableUrl allow direct construction Oct 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @OussamaSaoudi -- this looks great. I haev a small concern about this causing a regression for certain users. Let me know what you think

where
P: AsRef<str>,
{
impl DataFilePaths for Vec<&str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous implementation would have allowed Vec<String> (and anything else) that implemented AsRef<str> as well and after this change Vec<String>

I wonder if we could change the signature to something like

impl<P> DataFilePaths for Vec<P>
where
    P: Into<ListingTableUrl>,

Which should still work for anything that allows AsRef<str> but also allow Vec<ListingTableUrl> 🤔

Copy link
Contributor Author

@OussamaSaoudi OussamaSaoudi Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this may fail since AsRef<str> doesn't implyInto<ListingTableUrl>. It has to go through a ListingTableUrl::parse, which is fallible.

I'll give it some more thought and play with the types. Perhaps I'll try to change how SessionContext::read_parquet handles AsRef<str> and ListingTableUrl.

Setting back to draft for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@OussamaSaoudi OussamaSaoudi Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's a clean way to implement this change without a breaking existing code. Here are a few things I considered:

  • I briefly considered changing the as_str impl of ListingTableChanges to make it succeed with a parse, but this is both a breaking change and pretty silly.
  • I tried using a blanket implement using TryInto, and use ListingTableUrl::parse as a TryInto<ListingTableUrl> impl. However I hit some issues:
impl<P> DataFilePaths for Vec<P>
where P: TryInto<ListingTableUrl, Error=DataFusionError>

The problem is the associated type for TryInto. A single type (exListingTableUrl) can implement TryInto<ListingTableUrl, Error=Infallible> , or TryInto<ListingTableUrl, Error=DataFusionError> (ex: through parse). This of course leads to conflicting types for my blanket impl approach.

  • I looked through SessionContext::read_parquet. We can't change the use of DataFilePaths::to_urls() since that's baked into the API:
    pub async fn read_parquet<P: DataFilePaths>(
        &self,
        table_paths: P,
        options: ParquetReadOptions<'_>,

I would've preferred to separate the concerns of parsing and fetching parquet. An approach that parses strings into ListingTableUrl before calling read_parquet would be my preferred solution.

If DataFusion is not looking for breaking changes (understandably), I can go ahead and close this PR :)

@OussamaSaoudi OussamaSaoudi changed the title Make ListingTableUrl allow direct construction [WIP] Make ListingTableUrl allow direct construction Oct 25, 2024
@OussamaSaoudi OussamaSaoudi marked this pull request as draft October 25, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ListingTableUrl should allow direct construction
2 participants