Skip to content

Conversation

@gerashegalov
Copy link
Contributor

SuccessFileSource currently processes incomplete paths if the glob resolves to multiple leaf directories and only part of them are committed.

Furthermore it unnessarily queries the underlying filesystem twice just with different client-side path filters.

Git does not allow physically empty dirs, and therefore
scalding-core/src/test/resources/com/twitter/scalding/test_filesystem/test_data/2013/05
was missing and tested incorrectly

This PR proposes to accept a globbed path as long as all file's parent dirs contain _SUCCESS. This includes empty directories to allow for Hadoop's LazyOutputFormat.

- single glob call to save filesystem calls/RPC's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change in behavior should probably be in its own PR and more flagged for discussion I think as to how it should be ANN. Requiring more successfiles seems very safe, dropping the non-hidden paths 'weakens' the test (even if its maybe more correct). Probably just deserves its own flagging/PR to see if it should be in this trait or a new one.

@gerashegalov
Copy link
Contributor Author

The behavior is restored, IMO, hopefully just in a more efficient manner:
https://github.com/twitter/scalding/pull/1470/files#diff-b45007d4ad3065818f4b473453d891d4R164

I might need to add another test, to show that

@ianoc
Copy link
Collaborator

ianoc commented Jan 14, 2016

@gerashegalov not sure if you saw my comment directly on the commit but...:::

sorry to do this to you, but to keep the previous behaviours intact, however wonky, looking at this we will need 2 methods.

  1. def globHasSuccessFile(globPath: String, conf: Configuration): Boolean = globHasFiles(globPath,conf, successFilter=true, hiddenFilter=False)

def globHasFiles(globPath: String, conf: Configuration, successFilter: Boolen, hiddenFilter: Boolean)

and then call the globHasFiles(...,true,true) from the one below.

The only downside here is just with the addition of the hidden like this we have changed the behavior of globHasSuccess to be the same as our SuccessFileSource trait...which as logical as it sounds they should be the same, arent....
Add a line note

@gerashegalov
Copy link
Contributor Author

@ianoc I added more tests to make the point that the original behavior is not changed after the previous commit gerashegalov@bb8bcd9 for addressing your first review. The only test I had to change
TestFileSource:

    "accept a single directory without glob" in {
      pathIsGood("test_data/2013/05/") shouldBe true
    }

The existing test was incorrect, and was only succeeding because git does not support physically empty directories, and the directory 05 was missing in the test environment.

…ld remain filtering out hidden files for now
@ianoc
Copy link
Collaborator

ianoc commented Jan 15, 2016

Awesome, thanks @gerashegalov

ianoc added a commit that referenced this pull request Jan 15, 2016
SuccessFileSource: correctness for multi-dir globs
@ianoc ianoc merged commit 2eb06d8 into twitter:develop Jan 15, 2016
@gerashegalov
Copy link
Contributor Author

Thanks for helpful review @ianoc and @johnynek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants