-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 2393: Support glob patterns for files #2394
Conversation
Resolved failing clippy, lint and cargo fmt errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @timvw! The changes look reasonable to me.
Not had time to look in detail, but I think the code now mixes blocking IO, as performed by glob, and non-blocking IO as performed by tokio::fs. I think we should consistently use one or the other... Perhaps we could use tokio::spawn_blocking if a runtime is available, and then use blocking IO within the closure (this is what tokio::fs does under the hood anyway). Perhaps checkout ParquetExec for an example of this? |
Had a look at the implementation of tokio fs operations, and copied their (non-public) asyncify method to run the globbing on... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liking where this is headed 👍
Afaik this should be good to go now. |
Tests pass now on my windows machine as well :) |
Fixed clippy issues + documented the checks that are executed during a PR build (+ added script to execute them). |
Perhaps we could split out the formatting and CI changes into a separate PR, as I think they're someone parallel to the changes in this PR? |
…h_suffix list_file_with_suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you 👍, one minor nit which you can take or leave, will merge tomorrow unless anyone else weighs in.
Edit: also appears to be a merge conflict
data-access/src/object_store/mod.rs
Outdated
/// Filters the file_stream to only contain files that end with suffix | ||
fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> { | ||
let suffix = suffix.to_owned(); | ||
Ok(Box::pin(file_stream.filter(move |fr| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try_filter would perhaps be nicer https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the code that was already in place.
When I update the code to use try_filter, it will not pass-through the errors anymore (not sure if anyone was actually doing something with them though..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased to handle the merge conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will not pass-through the errors anymore
It should do? "All errors are passed through without filtering in this combinator."
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Thanks again 👍 |
Which issue does this PR close?
Closes #2393.
What changes are included in this PR?
When the provided prefix is not a valid path (file/directory), we will now attempt to "glob" over the prefix.
Eg: when the path is
/testdata/blah.txt -> return /testdata/blah.txt (as before)
/testdata -> return all files that exist in /testdata (as before)
/testdata/x_z -> return all files that match the given pattern (assuming that there is no file/directory '/testdata/x_z')