-
Notifications
You must be signed in to change notification settings - Fork 704
SuccessFileSource: correctness for multi-dir globs #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- single glob call to save filesystem calls/RPC's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change in behavior should probably be in its own PR and more flagged for discussion I think as to how it should be ANN. Requiring more successfiles seems very safe, dropping the non-hidden paths 'weakens' the test (even if its maybe more correct). Probably just deserves its own flagging/PR to see if it should be in this trait or a new one.
|
The behavior is restored, IMO, hopefully just in a more efficient manner: I might need to add another test, to show that |
|
@gerashegalov not sure if you saw my comment directly on the commit but...::: sorry to do this to you, but to keep the previous behaviours intact, however wonky, looking at this we will need 2 methods.
def globHasFiles(globPath: String, conf: Configuration, successFilter: Boolen, hiddenFilter: Boolean) and then call the globHasFiles(...,true,true) from the one below. The only downside here is just with the addition of the hidden like this we have changed the behavior of globHasSuccess to be the same as our SuccessFileSource trait...which as logical as it sounds they should be the same, arent.... |
|
@ianoc I added more tests to make the point that the original behavior is not changed after the previous commit gerashegalov@bb8bcd9 for addressing your first review. The only test I had to change The existing test was incorrect, and was only succeeding because git does not support physically empty directories, and the directory 05 was missing in the test environment. |
…ld remain filtering out hidden files for now
|
Awesome, thanks @gerashegalov |
SuccessFileSource: correctness for multi-dir globs
SuccessFileSource currently processes incomplete paths if the glob resolves to multiple leaf directories and only part of them are committed.
Furthermore it unnessarily queries the underlying filesystem twice just with different client-side path filters.
Git does not allow physically empty dirs, and therefore
scalding-core/src/test/resources/com/twitter/scalding/test_filesystem/test_data/2013/05
was missing and tested incorrectly
This PR proposes to accept a globbed path as long as all file's parent dirs contain _SUCCESS. This includes empty directories to allow for Hadoop's LazyOutputFormat.