Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor RDD loading; explicitly load alignments #468

Merged
merged 1 commit into from
Nov 14, 2014

Conversation

ryan-williams
Copy link
Member

There’s really not a case where we are loading a generic Parquet file of
SpecificRecords whose type we don’t have a strong requirement about;
on the other hand, there is a case where we know what type we want
(AlignmentRecord) but don’t know which code path to read it in via (sam,
bam, ifq, fq, parquet, …).

Here I’ve made the caller explicitly specify if it wants
AlignmentRecords (which it was previously doing by labeling the return
value), but wants to benefit from the smarts around file-extension
inference.

Parquet reads with a known type can still go through one place so that
we can continue to benefit from that code not needing to be duplicated,
by only relying on SpecificRecord.

@fnothaft
Copy link
Member

fnothaft commented Nov 7, 2014

+1, I think this is a good idea. We should extend it to variants as well, at a later time.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/382/

Build result: FAILURE

GitHub pull request #468 of commit b9b4718 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/468/merge^{commit} # timeout=10Checking out Revision 81f2b1f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 81f2b1f > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 2.3.0,centosADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 2.3.0,centos completed with result FAILURE
Test FAILed.

@massie
Copy link
Member

massie commented Nov 7, 2014

Jenkins retest this.

@ryan-williams
Copy link
Member Author

@massie do you know why Jenkins failed? or why the retest? (I thought tests were passing for me locally; trying again now as well)

@fnothaft
Copy link
Member

fnothaft commented Nov 7, 2014

@ryan-williams it looks like the SAM loading tests are failing:

Error Message

Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
Stacktrace

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:190)
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:146)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:458)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
      at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
      at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
      at org.bdgenomics.adam.plugins.Take10Plugin.run(Take10Plugin.scala:30)
      at org.bdgenomics.adam.cli.PluginExecutor.run(PluginExecutor.scala:121)
      at 

@ryan-williams
Copy link
Member Author

yea seems real, let me look, I swear they were passing very shortly before I pushed

@ryan-williams
Copy link
Member Author

cool, that should have fixed it, just needed to bring PluginExecutor into the future

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/383/
Test PASSed.

There’s really not a case where we are loading a generic Parquet file of
SpecificRecords whose type we don’t have a strong requirement about;
on the other hand, there is a case where we know what type we want
(AlignmentRecord) but don’t know which code path to read it in via (sam,
bam, ifq, fq, parquet, …).

Here I’ve made the caller explicitly specify if it wants
AlignmentRecords (which it was previously doing by labeling the return
value), but wants to benefit from the smarts around file-extension
inference.

Parquet reads with a known type can still go through one place so that
we can continue to benefit from that code not needing to be duplicated,
by only relying on SpecificRecord.
@ryan-williams
Copy link
Member Author

bump! this good to go?

@fnothaft
Copy link
Member

Ah, yes! Thanks for the ping. Can you rebase? I'll merge once rebased.

@ryan-williams
Copy link
Member Author

I believe it is now rebased

fnothaft added a commit that referenced this pull request Nov 14, 2014
refactor RDD loading; explicitly load alignments
@fnothaft fnothaft merged commit a2673e1 into bigdatagenomics:master Nov 14, 2014
@fnothaft
Copy link
Member

Merged! Thanks @ryan-williams!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/399/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants