Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to use glob/wildcard in file path #2393

Closed
timvw opened this issue Apr 30, 2022 · 9 comments · Fixed by #2394
Closed

Allow user to use glob/wildcard in file path #2393

timvw opened this issue Apr 30, 2022 · 9 comments · Fixed by #2394
Labels
enhancement New feature or request

Comments

@timvw
Copy link
Contributor

timvw commented Apr 30, 2022

Currently it is only possible to use a fully qualified path to a file or a folder, eg:

  • /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_01.csv
  • /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard

In situations when a folder contains multiple "sorts" of files it would be handy to use a glob/wildcard to list relevant files, eg:

  • /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv
    (This would allow one to exclude files such as red_01.csv in that same folder)

Currently this does not seem to work due to some "incorrect/flawed" logic in local.rs / list_all / tokio::fs::metadata(&prefix).await?;

@timvw timvw added the enhancement New feature or request label Apr 30, 2022
@tustvold
Copy link
Contributor

tustvold commented May 1, 2022

I'm not sure how I feel about having the local object store treat paths differently from the other implementations. Perhaps we should consistently support glob expressions as part of the ObjectStore trait, or not? Having a mix seems unfortunate...

FYI @matthewmturner @alamb

@timvw
Copy link
Contributor Author

timvw commented May 1, 2022

The thing is, my believe is that most objectstores already support globbing (Otherwise, yes, making it explicit on the trait would have been valuable).

eg: in datafusion-objectstore-s3 when I modify a test in s3.rs to use globbing instead of filename it keeps working:

    let mut files = s3_file_system
        .list_file("data/alltypes_plain.sn*py.parquet") //l379
        .await?;

same for datafusion-objectstore-azure

    let mut files = azure_file_system
        .list_file("parquet-testing-data/alltypes_plai*.snappy.parquet") //283
        .await?;

Todos:

  • Consider to make globbing explicit on ObjectStore trait?
  • Add tests in all objectstores to verify that the globbing is working as expect
  • Make globbing also work on ListingTable

What do you think?

@tustvold
Copy link
Contributor

tustvold commented May 2, 2022

Does this work correctly with multiple files, or wildcards in the path and not the file? I'm very surprised to see this working.

I ask as it is just calling out to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html which doesn't support wildcards, let alone glob expressions. S3 doesn't even have a defined concept of a directory, so globbing has to be implemented client side.

Edit: in fact the S3 tests are written in such a way as they don't fail if the list returns no results. Is this what is going on here? Perhaps we should fix them to assert the correct results are returned?

@timvw
Copy link
Contributor Author

timvw commented May 2, 2022

@tustvold : Good remarks/questions!

This is what I can do:

  • have a look at the tests (s3 and azure) and verify that they fail when they should..
  • verify how globbing could work on local, s3, azure and hdfs

@timvw
Copy link
Contributor Author

timvw commented May 2, 2022

Can confirm that globbing indeed does not work OOTB on azure and s3.

Added additional assertions to the respective projects to ensure that at least one or more file(s) are processed

@alamb
Copy link
Contributor

alamb commented May 2, 2022

timvm -- I wonder if you could do the "globbing" at a higher level (aka what interface would the user decide to type in /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv)?

So rather than passing /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv to datafusion directly, you can implement something that uses the object store's list_all feature to implement whatever globbing semantics you wanted, and then pass the fully resolved list of files to datafusion?

I am not sure if this would work for your usecase

@timvw
Copy link
Contributor Author

timvw commented May 2, 2022

@alamb My preference is to not push this to the user, but have it available in datafusion itself (expectation is to have behaviour very similar to apache spark).

Will have a look at hadoop fs to find some inspiration on how they implemented that (across filesystem implementations) and potentially come back with a proposal ;)

@timvw
Copy link
Contributor Author

timvw commented May 2, 2022

Currently spark does the following:
Datasource (paths) -> (when globPaths option is true) -> checkAndGlobPathIfNecessary
when no glob pattern in path -> fs.listfiles(path)
when glob pattern in path -> globber.glob(fs.listfiles(path))

As @tustvold already suggested, adding a glob_files method to ObjectStore seems the appropriate way to implement this feature. This method should then:
when no glob pattern in path -> simply list_files
when glob pattern in path -> glob (list_files) (can implement this with file_stream.filter, similar to existing list_file_with_suffix)

Reworked my code in #2394 to conform with the above.

@timvw
Copy link
Contributor Author

timvw commented May 3, 2022

With the current code my use-case works:

    let ctx = SessionContext::new();
    let nycdata = "/Users/timvw/nyc/trip data";
    let yellow = format!("{}/yellow_tripdata_2018-0[345].csv", nycdata);
    let options = CsvReadOptions::new();
    let df = ctx.read_csv(yellow, options).await?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants