Parquet-related Utility Classes #264

tdanford · 2014-06-10T15:01:56Z

This includes several utility classes that we wrote, as part of our work to adapt and re-use Parquet within Spark and ADAM.

One of our goals was to re-use as little Hadoop-specific functionality as possible, since disentangling Parquet from Hadoop was important (for a number of reasons, not least of which are class conflicts when it comes to wrapping HTTP services around a Spark cluster).

This PR includes three major components:

ByteAccess -- which is a trait whose implementations wrap any source of bytes to which random access is available
some classes that help us load and manage S3 and AWS credentials, and
FileLocator -- a trait whose implementations are a union of Hadoop's "Path" class, Java's File class, URLs, and pointers into S3.

Carl and I are eager to hear, though, if anyone objects to these implementations and would like to suggest an alternative class or type that already exists and does the same thing. We're eager not to re-invent too many wheels.

AmplabJenkins · 2014-06-10T15:03:37Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/365/

tdanford · 2014-06-10T15:05:47Z

That's an interesting failure, "cannot find 'git'."

fnothaft · 2014-06-11T03:12:25Z

adam-core/pom.xml

@@ -147,5 +147,10 @@
 <artifactId>scalatest_${scala.artifact.suffix}</artifactId>
 <scope>test</scope>
 </dependency>
+


Tighten up this space ;).

fnothaft · 2014-06-11T03:24:21Z

Jenkins, retest this please.

fnothaft · 2014-06-11T03:27:19Z

@tdanford @carlyeks this code looks good to me, but it's somewhat sparsely documented (see my comments above).

I'm concerned that ADAM isn't the best place to put this code. Perhaps it'd be best to create some sort of S3Utils/ParquetUtils/FileUtils project in bdgenomics that we put these in? My inking is that we may want to use this code later outside of ADAM(/other people may want to use this code), and that we wouldn't want to pull all of adam in to access just these classes. If you folks are interested in this, I can set up a repository, and work to get mvn deployment and Jenkins set up.

AmplabJenkins · 2014-06-11T03:37:11Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/366/

tdanford · 2014-06-11T14:55:51Z

Frank, we (Carl and I) are fine doing it in a different repository, if you want to set one up. However, we didn't think that this would ultimately go into ADAM, but rather, this is the first PR (of several) which supports our internal re-implementation of Parquet's metadata data structures and some indexing on top of Parquet files.

I think Carl and I were expecting that this code would ultimately be submitted to Parquet, but we were assuming that we could use ADAM as a way-station as part of our effort to quickly get it out of our private-only internal repos.

@massie thoughts?

fnothaft · 2014-06-11T14:58:13Z

@tdanford ah, I see. That should be OK then.

tdanford · 2014-06-11T15:01:05Z

The ordering here, by the way, will be that once (1) this PR is merged, and (2) the flat genotype PR is merged, then there will be either one or two more PRs which contain Parquet-specific code. Mostly re-implementing a bunch of the internal Parquet data structures in Scala, followed by using them to build an RDD on top of parquet files directly, and then adding some code to index parquet files by genomic region and sample and then building an RDD using those indices.

The parquet classes themselves, along with the support classes (in this PR), are the ones that will go into Parquet ultimately I hope.

fnothaft · 2014-06-11T15:05:07Z

Indexing support sounds pretty cool! I know @massie is very interested in these. BTW, @massie is working on fixing Jenkins. There's some issue where the VMs we use to run Jenkins went down... hopefully this will be resolved by the end of today.

AmplabJenkins · 2014-06-11T17:13:01Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/367/

AmplabJenkins · 2014-06-11T17:30:47Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/368/

massie · 2014-06-12T01:01:26Z

adam-core/src/main/scala/org/bdgenomics/adam/util/ByteAccess.scala

+}
+
+/**
+ * This is somewhat poorly named, it probably should be LocalFileByteAccess


Isn't that what it's named?

Jedi mind tricks...

I forgot to remove that comment, when I renamed it in the most recent rebase.

AmplabJenkins · 2014-06-13T17:00:37Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/370/

massie · 2014-06-14T03:13:15Z

Jenkins, test this please.

AmplabJenkins · 2014-06-14T03:22:09Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/371/

massie · 2014-06-14T03:23:58Z

Jenkins, test this please.

AmplabJenkins · 2014-06-14T03:33:38Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/372/

massie · 2014-06-16T15:29:31Z

Jenkins, test this please.

AmplabJenkins · 2014-06-16T16:09:08Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/378/

fnothaft · 2014-06-16T16:10:20Z

Jenkins, retest this please.

Looks like a network address bind error...

carlyeks · 2014-06-16T16:53:46Z

Jenkins recovers from those network bind errors, but the jar file is invalid; it is failing the intergration test in the very last few lines of the output. @tdanford suggested that we are probably not shading the jar file correctly (seems plausible), so we will have to fix this.

Tests save the day!!

tdanford · 2014-06-16T16:54:07Z

Ugh. Yeah.

fnothaft · 2014-06-16T16:56:50Z

Ah, fascinating! Does this show up on your end as well, or just in Jenkins?

tdanford · 2014-06-16T20:46:30Z

It does show up on our end. It is very confusing (I was saying this in IRC too), here's the rundown as we have it so far.

The current master builds a fine jar file, but (if you look in our repo), commit 37c7060 produces a build which fails. All that commit does is include the AWS SDK.

Okay, so, looking at the "corrupt" jar a little more closely -- if you do
java -jar adam-0.11.1-SNAPSHOT.jar
fails. But if you do,
java -cp adam-0.11.1-SNAPSHOT.jar org.bdgenomics.adam.cli.ADAMMain

it works fine! Weird. So we figured it was the MANIFEST.MF files, but ... they are identical between jars. So now we're at the point where we're diffing pieces of the unpacked jars and trying to guess what's going wrong.

Anyone else have any ideas?

carlyeks · 2014-06-17T15:39:27Z

I think we have discovered the source of the issue: ZIP doesn't support as many entries as we now have in our uber jar. We have over 74000 with the addition of the AWS SDK dependency, with the limit being 65535.
One solution is to use minimizeJar, which searches for the classes which are actually referenced in the jar file; however, if something is referenced by name only (as in log4j dependencies), they will not be included in the jar. Another potential solution is getting maven to use ZIP64 (which is supposedly supported by Java 7) -- haven't figured out how to do this one yet.

fnothaft · 2014-06-17T15:51:59Z

Nice! I did some grepping, and it looks like Maven's assembly plugin should support ZIP64. It doesn't seem to be documented well, but here's the general "chain" of resources I found pointing to this:

I can take a look at this in the PM, if necessary.

carlyeks · 2014-06-17T16:34:12Z

Since the AmazonAWS SDK is 8.8k files by itself (...), I changed that to a provided, so that at least it compiles again.
The problem with the minimizeJars is the log4j and the hadoop dependencies are all mentioned by string name (and thus fail), so it seems like a can of worms trying to get that to work.
I think the easiest longer term fix is to ZIP64 the uberjar.

tdanford · 2014-06-17T16:38:32Z

Carl and I actually disagree on this -- I am leaning towards not creating an uber-jar at all, as part of any (truly) long-term fix.

AmplabJenkins · 2014-06-17T16:39:46Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/384/

carlyeks · 2014-06-17T16:42:26Z

Now we're cooking with gas.

tdanford · 2014-06-17T16:42:54Z

Should we squash these commits down too? (By the way, I'm still confused as to why the 'java -cp' approach, outlined above, worked when 'java -jar' didn't.)

carlyeks · 2014-06-17T16:47:21Z

Luck of the jar, my friend.

Probably warrants some investigation...

tdanford · 2014-06-17T16:49:19Z

I've rebased this PR on top of master and down into the (original) four commits, which are each somewhat standalone but do actually build off each other.

I can rebase down into a single commit though, if anyone cares.

tdanford · 2014-06-17T16:54:55Z

"Luck of the jar, my friend."

-- I think you mean, "jarnanigans."

carlyeks · 2014-06-17T16:58:25Z

"jarish"?

AmplabJenkins · 2014-06-17T16:59:45Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/385/

fnothaft · 2014-06-17T17:02:56Z

I have a general preference towards the commits being squashed down, otherwise, looks good!

The following changes were made as part of this commit: * Added a dependency on the AWS Java SDK. This dependency is listed as 'provided', because the AWS SDK for Java adds 8k+ classes, which pushes our uber-jar over the standard 65k file limit in the Zip archive. * Added ByteAccess trait and implementations. ByteAccess is a trait representing "sources of bytes" for which random access is available, e.g. local files, S3, etc. This is going to be used to wrap different data storage back-ends, for raw byte access and allowing us to go around the hadoop library itself. * Added SerializableAWSCredentials is an AWSCredentials which is Serializable. We're going to need to pass around (serializable) credentials for S3, in order to implement our ParquetRDD that reads directly from S3. * Added CredentialsProperties encapsulates reading AWS credentials from configuration. This is a wrapper class, designed to allow us to recognize and retrieve credentials from the environment, config files, or other sources. * Added FileLocator and implementations. FileLocator, and its implementations, are a common wrapper around local Files, classpath locations, bucket/keyname pairs in S3, and static byte arrays (for testing). This is a way (also serializable) for naming the location of parquet files and indices in different systems.

tdanford · 2014-06-17T18:03:27Z

Squashed!

AmplabJenkins · 2014-06-17T18:14:45Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/386/

Parquet-related Utility Classes

fnothaft · 2014-06-17T18:15:13Z

Merged! Thanks @carlyeks and @tdanford!

tdanford · 2014-06-17T18:42:17Z

Awesome. Thanks Frank!

fnothaft reviewed Jun 11, 2014
View reviewed changes

massie reviewed Jun 12, 2014
View reviewed changes

fnothaft added a commit that referenced this pull request Jun 17, 2014

Merge pull request #264 from genomebridge/parquet-utils-rebase

86ed049

Parquet-related Utility Classes

fnothaft merged commit 86ed049 into bigdatagenomics:master Jun 17, 2014

tdanford deleted the parquet-utils-rebase branch June 17, 2014 18:42

Parquet-related Utility Classes #264

Parquet-related Utility Classes #264

Conversation

tdanford commented Jun 10, 2014

AmplabJenkins commented Jun 10, 2014

tdanford commented Jun 10, 2014

fnothaft Jun 11, 2014

Choose a reason for hiding this comment

fnothaft commented Jun 11, 2014

fnothaft commented Jun 11, 2014

AmplabJenkins commented Jun 11, 2014

tdanford commented Jun 11, 2014

fnothaft commented Jun 11, 2014

tdanford commented Jun 11, 2014

fnothaft commented Jun 11, 2014

AmplabJenkins commented Jun 11, 2014

AmplabJenkins commented Jun 11, 2014

massie Jun 12, 2014

Choose a reason for hiding this comment

fnothaft Jun 12, 2014

Choose a reason for hiding this comment

tdanford Jun 12, 2014

Choose a reason for hiding this comment

tdanford Jun 13, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 13, 2014

massie commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

massie commented Jun 14, 2014

AmplabJenkins commented Jun 14, 2014

massie commented Jun 16, 2014

AmplabJenkins commented Jun 16, 2014

fnothaft commented Jun 16, 2014

carlyeks commented Jun 16, 2014

tdanford commented Jun 16, 2014

fnothaft commented Jun 16, 2014

tdanford commented Jun 16, 2014

carlyeks commented Jun 17, 2014

fnothaft commented Jun 17, 2014

carlyeks commented Jun 17, 2014

tdanford commented Jun 17, 2014

AmplabJenkins commented Jun 17, 2014

carlyeks commented Jun 17, 2014

tdanford commented Jun 17, 2014

carlyeks commented Jun 17, 2014

tdanford commented Jun 17, 2014

tdanford commented Jun 17, 2014

carlyeks commented Jun 17, 2014

AmplabJenkins commented Jun 17, 2014

fnothaft commented Jun 17, 2014

tdanford commented Jun 17, 2014

AmplabJenkins commented Jun 17, 2014

fnothaft commented Jun 17, 2014

tdanford commented Jun 17, 2014