-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-463] AvroMultiParquetRDD loads multiple Parquet files #472
Conversation
Fixes #463 |
Looks like some binary files e.g. |
Yeah, we want to make sure we can load Parquet files directly -- and we're testing the childLocators() method of ClasspathFileLocator at the same time. |
fd61a03
to
db20564
Compare
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit fd61a03 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision 2a9921c (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 2a9921c > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result SUCCESSADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result SUCCESSTest FAILed. |
GAH, HDFSFileLocator. |
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit db20564 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision 4f75a4a (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 4f75a4a > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result SUCCESSADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURETest FAILed. |
db20564
to
3cc1afe
Compare
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit 3cc1afe automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision d4ff591 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d4ff591 > git rev-list 4f75a4a # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURETest FAILed. |
I'd prefer to open this against bdg-utils |
@@ -36,6 +36,7 @@ package org.bdgenomics.adam.io | |||
trait FileLocator extends Serializable { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this file could use some Scaladoc love.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused -- doesn't FileLocator have a long scaladoc? Or do you mean method-by-method docs? I'll add those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was going for method-by-method; thanks for adding!
3cc1afe
to
ed82e55
Compare
@fnothaft added some additional docs per your suggestion -- also added issues #476 and #477, and broke up the contents of the first commit to reflect changes that address specifically those issues. Just to be clear about what's being added here. Agree that this should go into bdg-utils, but maybe we can get it working and committed here, and then go make the changes over there too. |
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit ed82e55 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision cc683a5 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f cc683a5 > git rev-list d893364b0ad2ed754626d1fe1814062bde73bed3 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURETest FAILed. |
This appears to be a Jenkins error again. |
ed82e55
to
e6786de
Compare
Test PASSed. |
Frank, I added some more docs, this has been rebased, and (thanks to Carl's fixes) now builds against all versions of Hadoop we test against. Personally, I think this should be merged before your other PR (#478) which removes a lot of these methods from adam-core; if you merge this, then I'll file a PR that contains the same changes to the corresponding classes back in bdg-utils. |
@tdanford OK, sure. Let's get this in, then move the code to bdg-utils. I'll review now. |
|
||
override def bytes: ByteAccess = new HTTPRangedByteAccess(uri, retries) | ||
|
||
/** | ||
* Unsupported in HTTPFileLocator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, what's the use case for HTTP access in the end? Not really germane to this PR, but should be documented eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP access was added mainly for testing purposes.
LGTM; the one thing I'd like added is a NetworkConnected test that tries to access a multi-Parquet file from S3. I'll put a file up for you in a short bit. |
Test PASSed. |
Frank, I'm not understanding this -- the tests pass on Jenkins, but not for me locally? What does it look like for you? |
Updated the FileLocator trait to include a childLocators method, which implements the ability to scan the children of a locator (in addition to finding the parent). This will be required for the AvroMultiParquetRDD we're planning to introduce. Unfortunately (see the comments) this isn't possible to implement in HTTPFileLocator without additional assumptions. Also filled in the implementation of some missing methods in HTTPFileLocator in this commit, and added comments to FileLocator overall.
HDFSFileLocator implements the FileLocator API, but uses the Hadoop interface to find files in an HDFS system. Includes the InputStreamByteAccess implementation, which is used internally.
AvroMultiParquetRDD takes a FileLocator which indicates the _parent_ directory (in HDFS, or S3) of a set of parquet files, and which loads _all_ row groups from all the parquet files that are immediate children of that FileLocator. (Here, "children" is defined w/r/t the childLocators method implementation in the corresponding FileLocator.) (Joint work with Carl Yeksigian.)
889ace2
to
8b0d214
Compare
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit 8b0d214 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision b1126ca (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f b1126ca > git rev-list 7402be7 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 1.0.4,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILURETest FAILed. |
Jenkins, retest this please. |
Test FAILed. Build result: FAILUREGitHub pull request #472 of commit 8b0d214 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision d2b6d8f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d2b6d8f > git rev-list e85aea7 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 1.0.4,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILURETest FAILed. |
Closing this as after #478 this will need to be moved to the bdg-utils repo. |
[ADAM-463] Added AvroMultiParquetRDD
First, updated the FileLocator to include a childLocators method, and added its implementation in all the child FileLocator implementations.
Next, added an HDFSFileLocator implementation.
Finally, added AvroMultiParquetRDD, which takes a FileLocator which indicates the parent directory (in HDFS, or S3) of a set of parquet files, and which loads all row groups from all the parquet files that are immediate children of that FileLocator.
(Joint work with Carl Yeksigian.)