Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-463] AvroMultiParquetRDD loads multiple Parquet files #472

Closed
wants to merge 3 commits into from

Conversation

tdanford
Copy link
Contributor

[ADAM-463] Added AvroMultiParquetRDD

First, updated the FileLocator to include a childLocators method, and added its implementation in all the child FileLocator implementations.

Next, added an HDFSFileLocator implementation.

Finally, added AvroMultiParquetRDD, which takes a FileLocator which indicates the parent directory (in HDFS, or S3) of a set of parquet files, and which loads all row groups from all the parquet files that are immediate children of that FileLocator.

(Joint work with Carl Yeksigian.)

@tdanford tdanford changed the title Multi loader [ADAM-463] AvroMultiParquetRDD loads multiple Parquet files Nov 12, 2014
@tdanford
Copy link
Contributor Author

Fixes #463

@massie
Copy link
Member

massie commented Nov 12, 2014

Looks like some binary files e.g. adam-core/src/test/resources/parquet_test/readstest1-0-2-0 are in this commit? Is that intentional?

@tdanford
Copy link
Contributor Author

Yeah, we want to make sure we can load Parquet files directly -- and we're testing the childLocators() method of ClasspathFileLocator at the same time.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/390/

Build result: FAILURE

GitHub pull request #472 of commit fd61a03 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision 2a9921c (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 2a9921c > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result SUCCESSADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result SUCCESS
Test FAILed.

@tdanford
Copy link
Contributor Author

GAH, HDFSFileLocator.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/391/

Build result: FAILURE

GitHub pull request #472 of commit db20564 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision 4f75a4a (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 4f75a4a > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result SUCCESSADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURE
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/392/

Build result: FAILURE

GitHub pull request #472 of commit 3cc1afe automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision d4ff591 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d4ff591 > git rev-list 4f75a4a # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURE
Test FAILed.

@fnothaft
Copy link
Member

I'd prefer to open this against bdg-utils

@@ -36,6 +36,7 @@ package org.bdgenomics.adam.io
trait FileLocator extends Serializable {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this file could use some Scaladoc love.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused -- doesn't FileLocator have a long scaladoc? Or do you mean method-by-method docs? I'll add those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was going for method-by-method; thanks for adding!

@tdanford
Copy link
Contributor Author

@fnothaft added some additional docs per your suggestion -- also added issues #476 and #477, and broke up the contents of the first commit to reflect changes that address specifically those issues. Just to be clear about what's being added here.

Agree that this should go into bdg-utils, but maybe we can get it working and committed here, and then go make the changes over there too.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/396/

Build result: FAILURE

GitHub pull request #472 of commit ed82e55 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision cc683a5 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f cc683a5 > git rev-list d893364b0ad2ed754626d1fe1814062bde73bed3 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILURE
Test FAILed.

@tdanford
Copy link
Contributor Author

This appears to be a Jenkins error again.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/407/
Test PASSed.

@tdanford
Copy link
Contributor Author

Frank, I added some more docs, this has been rebased, and (thanks to Carl's fixes) now builds against all versions of Hadoop we test against.

Personally, I think this should be merged before your other PR (#478) which removes a lot of these methods from adam-core; if you merge this, then I'll file a PR that contains the same changes to the corresponding classes back in bdg-utils.

@tdanford tdanford mentioned this pull request Nov 16, 2014
@fnothaft
Copy link
Member

@tdanford OK, sure. Let's get this in, then move the code to bdg-utils. I'll review now.


override def bytes: ByteAccess = new HTTPRangedByteAccess(uri, retries)

/**
* Unsupported in HTTPFileLocator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, what's the use case for HTTP access in the end? Not really germane to this PR, but should be documented eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP access was added mainly for testing purposes.

@fnothaft
Copy link
Member

LGTM; the one thing I'd like added is a NetworkConnected test that tries to access a multi-Parquet file from S3. I'll put a file up for you in a short bit.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/426/
Test PASSed.

@tdanford
Copy link
Contributor Author

Frank, I'm not understanding this -- the tests pass on Jenkins, but not for me locally? What does it look like for you?

Updated the FileLocator trait to include a childLocators method, which
implements the ability to scan the children of a locator (in addition to
finding the parent).  This will be required for the AvroMultiParquetRDD
we're planning to introduce.  Unfortunately (see the comments) this
isn't possible to implement in HTTPFileLocator without additional
assumptions.

Also filled in the implementation of some missing methods in
HTTPFileLocator in this commit, and added comments to FileLocator
overall.
HDFSFileLocator implements the FileLocator API, but uses the Hadoop
interface to find files in an HDFS system.

Includes the InputStreamByteAccess implementation, which is used
internally.
AvroMultiParquetRDD takes a FileLocator which indicates the _parent_
directory (in HDFS, or S3) of a set of parquet files, and which loads
_all_ row groups from all the parquet files that are immediate children
of that FileLocator.

(Here, "children" is defined w/r/t the childLocators method
implementation in the corresponding FileLocator.)

(Joint work with Carl Yeksigian.)
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/469/

Build result: FAILURE

GitHub pull request #472 of commit 8b0d214 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision b1126ca (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f b1126ca > git rev-list 7402be7 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 1.0.4,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILURE
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Dec 9, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/485/

Build result: FAILURE

GitHub pull request #472 of commit 8b0d214 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/472/merge^{commit} # timeout=10Checking out Revision d2b6d8f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f d2b6d8f > git rev-list e85aea7 # timeout=10Triggering ADAM-prb » 2.3.0,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 1.0.4,centosADAM-prb » 2.3.0,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 1.0.4,centos completed with result FAILURE
Test FAILed.

@fnothaft
Copy link
Member

fnothaft commented Feb 5, 2015

Closing this as after #478 this will need to be moved to the bdg-utils repo.

@fnothaft fnothaft closed this Feb 5, 2015
@fnothaft fnothaft deleted the multi-loader branch August 13, 2015 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants