-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to use glob/wildcard in file path #2393
Comments
I'm not sure how I feel about having the local object store treat paths differently from the other implementations. Perhaps we should consistently support glob expressions as part of the ObjectStore trait, or not? Having a mix seems unfortunate... |
The thing is, my believe is that most objectstores already support globbing (Otherwise, yes, making it explicit on the trait would have been valuable). eg: in datafusion-objectstore-s3 when I modify a test in s3.rs to use globbing instead of filename it keeps working:
same for datafusion-objectstore-azure
Todos:
What do you think? |
Does this work correctly with multiple files, or wildcards in the path and not the file? I'm very surprised to see this working. I ask as it is just calling out to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html which doesn't support wildcards, let alone glob expressions. S3 doesn't even have a defined concept of a directory, so globbing has to be implemented client side. Edit: in fact the S3 tests are written in such a way as they don't fail if the list returns no results. Is this what is going on here? Perhaps we should fix them to assert the correct results are returned? |
@tustvold : Good remarks/questions! This is what I can do:
|
Can confirm that globbing indeed does not work OOTB on azure and s3. Added additional assertions to the respective projects to ensure that at least one or more file(s) are processed |
timvm -- I wonder if you could do the "globbing" at a higher level (aka what interface would the user decide to type in So rather than passing I am not sure if this would work for your usecase |
@alamb My preference is to not push this to the user, but have it available in datafusion itself (expectation is to have behaviour very similar to apache spark). Will have a look at hadoop fs to find some inspiration on how they implemented that (across filesystem implementations) and potentially come back with a proposal ;) |
Currently spark does the following: As @tustvold already suggested, adding a glob_files method to ObjectStore seems the appropriate way to implement this feature. This method should then: Reworked my code in #2394 to conform with the above. |
With the current code my use-case works: let ctx = SessionContext::new();
let nycdata = "/Users/timvw/nyc/trip data";
let yellow = format!("{}/yellow_tripdata_2018-0[345].csv", nycdata);
let options = CsvReadOptions::new();
let df = ctx.read_csv(yellow, options).await?; |
Currently it is only possible to use a fully qualified path to a file or a folder, eg:
In situations when a folder contains multiple "sorts" of files it would be handy to use a glob/wildcard to list relevant files, eg:
(This would allow one to exclude files such as red_01.csv in that same folder)
Currently this does not seem to work due to some "incorrect/flawed" logic in local.rs / list_all / tokio::fs::metadata(&prefix).await?;
The text was updated successfully, but these errors were encountered: