Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet-related Utility Classes #264

Merged

Conversation

tdanford
Copy link
Contributor

This includes several utility classes that we wrote, as part of our work to adapt and re-use Parquet within Spark and ADAM.

One of our goals was to re-use as little Hadoop-specific functionality as possible, since disentangling Parquet from Hadoop was important (for a number of reasons, not least of which are class conflicts when it comes to wrapping HTTP services around a Spark cluster).

This PR includes three major components:

  • ByteAccess -- which is a trait whose implementations wrap any source of bytes to which random access is available
  • some classes that help us load and manage S3 and AWS credentials, and
  • FileLocator -- a trait whose implementations are a union of Hadoop's "Path" class, Java's File class, URLs, and pointers into S3.

Carl and I are eager to hear, though, if anyone objects to these implementations and would like to suggest an alternative class or type that already exists and does the same thing. We're eager not to re-invent too many wheels.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/365/

@tdanford
Copy link
Contributor Author

That's an interesting failure, "cannot find 'git'."

@@ -147,5 +147,10 @@
<artifactId>scalatest_${scala.artifact.suffix}</artifactId>
<scope>test</scope>
</dependency>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tighten up this space ;).

@fnothaft
Copy link
Member

Jenkins, retest this please.

@fnothaft
Copy link
Member

@tdanford @carlyeks this code looks good to me, but it's somewhat sparsely documented (see my comments above).

I'm concerned that ADAM isn't the best place to put this code. Perhaps it'd be best to create some sort of S3Utils/ParquetUtils/FileUtils project in bdgenomics that we put these in? My inking is that we may want to use this code later outside of ADAM(/other people may want to use this code), and that we wouldn't want to pull all of adam in to access just these classes. If you folks are interested in this, I can set up a repository, and work to get mvn deployment and Jenkins set up.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/366/

@tdanford
Copy link
Contributor Author

Frank, we (Carl and I) are fine doing it in a different repository, if you want to set one up. However, we didn't think that this would ultimately go into ADAM, but rather, this is the first PR (of several) which supports our internal re-implementation of Parquet's metadata data structures and some indexing on top of Parquet files.

I think Carl and I were expecting that this code would ultimately be submitted to Parquet, but we were assuming that we could use ADAM as a way-station as part of our effort to quickly get it out of our private-only internal repos.

@massie thoughts?

@fnothaft
Copy link
Member

@tdanford ah, I see. That should be OK then.

@tdanford
Copy link
Contributor Author

The ordering here, by the way, will be that once (1) this PR is merged, and (2) the flat genotype PR is merged, then there will be either one or two more PRs which contain Parquet-specific code. Mostly re-implementing a bunch of the internal Parquet data structures in Scala, followed by using them to build an RDD on top of parquet files directly, and then adding some code to index parquet files by genomic region and sample and then building an RDD using those indices.

The parquet classes themselves, along with the support classes (in this PR), are the ones that will go into Parquet ultimately I hope.

@fnothaft
Copy link
Member

Indexing support sounds pretty cool! I know @massie is very interested in these. BTW, @massie is working on fixing Jenkins. There's some issue where the VMs we use to run Jenkins went down... hopefully this will be resolved by the end of today.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/367/

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/368/

}

/**
* This is somewhat poorly named, it probably should be LocalFileByteAccess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what it's named?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jedi mind tricks...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove that comment, when I renamed it in the most recent rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/370/

@massie
Copy link
Member

massie commented Jun 14, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/371/

@massie
Copy link
Member

massie commented Jun 14, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/372/

@massie
Copy link
Member

massie commented Jun 16, 2014

Jenkins, test this please.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/378/

@fnothaft
Copy link
Member

Jenkins, retest this please.

Looks like a network address bind error...

@carlyeks
Copy link
Member

Jenkins recovers from those network bind errors, but the jar file is invalid; it is failing the intergration test in the very last few lines of the output. @tdanford suggested that we are probably not shading the jar file correctly (seems plausible), so we will have to fix this.

Tests save the day!!

@tdanford
Copy link
Contributor Author

Ugh. Yeah.

@fnothaft
Copy link
Member

Ah, fascinating! Does this show up on your end as well, or just in Jenkins?

@tdanford
Copy link
Contributor Author

It does show up on our end. It is very confusing (I was saying this in IRC too), here's the rundown as we have it so far.

The current master builds a fine jar file, but (if you look in our repo), commit 37c7060 produces a build which fails. All that commit does is include the AWS SDK.

Okay, so, looking at the "corrupt" jar a little more closely -- if you do
java -jar adam-0.11.1-SNAPSHOT.jar
fails. But if you do,
java -cp adam-0.11.1-SNAPSHOT.jar org.bdgenomics.adam.cli.ADAMMain

it works fine! Weird. So we figured it was the MANIFEST.MF files, but ... they are identical between jars. So now we're at the point where we're diffing pieces of the unpacked jars and trying to guess what's going wrong.

Anyone else have any ideas?

@carlyeks
Copy link
Member

I think we have discovered the source of the issue: ZIP doesn't support as many entries as we now have in our uber jar. We have over 74000 with the addition of the AWS SDK dependency, with the limit being 65535.
One solution is to use minimizeJar, which searches for the classes which are actually referenced in the jar file; however, if something is referenced by name only (as in log4j dependencies), they will not be included in the jar. Another potential solution is getting maven to use ZIP64 (which is supposedly supported by Java 7) -- haven't figured out how to do this one yet.

@fnothaft
Copy link
Member

Nice! I did some grepping, and it looks like Maven's assembly plugin should support ZIP64. It doesn't seem to be documented well, but here's the general "chain" of resources I found pointing to this:

I can take a look at this in the PM, if necessary.

@carlyeks
Copy link
Member

Since the AmazonAWS SDK is 8.8k files by itself (...), I changed that to a provided, so that at least it compiles again.
The problem with the minimizeJars is the log4j and the hadoop dependencies are all mentioned by string name (and thus fail), so it seems like a can of worms trying to get that to work.
I think the easiest longer term fix is to ZIP64 the uberjar.

@tdanford
Copy link
Contributor Author

Carl and I actually disagree on this -- I am leaning towards not creating an uber-jar at all, as part of any (truly) long-term fix.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/384/

@carlyeks
Copy link
Member

Now we're cooking with gas.

@tdanford
Copy link
Contributor Author

Should we squash these commits down too? (By the way, I'm still confused as to why the 'java -cp' approach, outlined above, worked when 'java -jar' didn't.)

@carlyeks
Copy link
Member

Luck of the jar, my friend.

Probably warrants some investigation...

@tdanford
Copy link
Contributor Author

I've rebased this PR on top of master and down into the (original) four commits, which are each somewhat standalone but do actually build off each other.

I can rebase down into a single commit though, if anyone cares.

@tdanford
Copy link
Contributor Author

"Luck of the jar, my friend."

-- I think you mean, "jarnanigans."

@carlyeks
Copy link
Member

"jarish"?

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/385/

@fnothaft
Copy link
Member

I have a general preference towards the commits being squashed down, otherwise, looks good!

The following changes were made as part of this commit:

* Added a dependency on the AWS Java SDK.

This dependency is listed as 'provided', because the AWS SDK for Java adds 8k+ classes,
which pushes our uber-jar over the standard 65k file limit in the Zip archive.

* Added ByteAccess trait and implementations.

ByteAccess is a trait representing "sources of bytes" for which random access is available,
e.g. local files, S3, etc. This is going to be used to wrap different data storage back-ends,
for raw byte access and allowing us to go around the hadoop library itself.

* Added SerializableAWSCredentials is an AWSCredentials which is Serializable.

We're going to need to pass around (serializable) credentials for S3, in order to
implement our ParquetRDD that reads directly from S3.

* Added CredentialsProperties encapsulates reading AWS credentials from configuration.

This is a wrapper class, designed to allow us to recognize and retrieve credentials from
the environment, config files, or other sources.

* Added FileLocator and implementations.

FileLocator, and its implementations, are a common wrapper around local Files, classpath locations,
bucket/keyname pairs in S3, and static byte arrays (for testing).  This is a way (also serializable)
for naming the location of parquet files and indices in different systems.
@tdanford
Copy link
Contributor Author

Squashed!

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/386/

fnothaft added a commit that referenced this pull request Jun 17, 2014
@fnothaft fnothaft merged commit 86ed049 into bigdatagenomics:master Jun 17, 2014
@fnothaft
Copy link
Member

Merged! Thanks @carlyeks and @tdanford!

@tdanford tdanford deleted the parquet-utils-rebase branch June 17, 2014 18:42
@tdanford
Copy link
Contributor Author

Awesome. Thanks Frank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants