refactor RDD loading; explicitly load alignments #468

ryan-williams · 2014-11-07T18:59:13Z

There’s really not a case where we are loading a generic Parquet file of
SpecificRecords whose type we don’t have a strong requirement about;
on the other hand, there is a case where we know what type we want
(AlignmentRecord) but don’t know which code path to read it in via (sam,
bam, ifq, fq, parquet, …).

Here I’ve made the caller explicitly specify if it wants
AlignmentRecords (which it was previously doing by labeling the return
value), but wants to benefit from the smarts around file-extension
inference.

Parquet reads with a known type can still go through one place so that
we can continue to benefit from that code not needing to be duplicated,
by only relying on SpecificRecord.

fnothaft · 2014-11-07T19:06:12Z

+1, I think this is a good idea. We should extend it to variants as well, at a later time.

AmplabJenkins · 2014-11-07T19:09:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/382/

Build result: FAILURE

GitHub pull request #468 of commit b9b4718 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/468/merge^{commit} # timeout=10Checking out Revision 81f2b1f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 81f2b1f > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 2.3.0,centosADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 2.3.0,centos completed with result FAILURE
Test FAILed.

massie · 2014-11-07T19:11:12Z

Jenkins retest this.

ryan-williams · 2014-11-07T19:31:42Z

@massie do you know why Jenkins failed? or why the retest? (I thought tests were passing for me locally; trying again now as well)

fnothaft · 2014-11-07T19:33:26Z

@ryan-williams it looks like the SAM loading tests are failing:

Error Message

Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
Stacktrace

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:190)
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:146)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:458)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
      at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
      at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
      at org.bdgenomics.adam.plugins.Take10Plugin.run(Take10Plugin.scala:30)
      at org.bdgenomics.adam.cli.PluginExecutor.run(PluginExecutor.scala:121)
      at

ryan-williams · 2014-11-07T19:43:18Z

yea seems real, let me look, I swear they were passing very shortly before I pushed

ryan-williams · 2014-11-07T21:08:48Z

cool, that should have fixed it, just needed to bring PluginExecutor into the future

AmplabJenkins · 2014-11-07T21:29:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/383/
Test PASSed.

There’s really not a case where we are loading a generic Parquet file of SpecificRecords whose type we don’t have a strong requirement about; on the other hand, there is a case where we know what type we want (AlignmentRecord) but don’t know which code path to read it in via (sam, bam, ifq, fq, parquet, …). Here I’ve made the caller explicitly specify if it wants AlignmentRecords (which it was previously doing by labeling the return value), but wants to benefit from the smarts around file-extension inference. Parquet reads with a known type can still go through one place so that we can continue to benefit from that code not needing to be duplicated, by only relying on SpecificRecord.

ryan-williams · 2014-11-14T17:31:37Z

bump! this good to go?

fnothaft · 2014-11-14T17:41:39Z

Ah, yes! Thanks for the ping. Can you rebase? I'll merge once rebased.

ryan-williams · 2014-11-14T17:47:05Z

I believe it is now rebased

refactor RDD loading; explicitly load alignments

fnothaft · 2014-11-14T17:49:15Z

Merged! Thanks @ryan-williams!

AmplabJenkins · 2014-11-14T18:03:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/399/
Test PASSed.

ryan-williams force-pushed the loading branch from b9b4718 to 187acbd Compare November 7, 2014 21:08

ryan-williams force-pushed the loading branch from 187acbd to 66dd5e7 Compare November 14, 2014 17:46

fnothaft added a commit that referenced this pull request Nov 14, 2014

Merge pull request #468 from ryan-williams/loading

a2673e1

refactor RDD loading; explicitly load alignments

fnothaft merged commit a2673e1 into bigdatagenomics:master Nov 14, 2014

This was referenced Nov 15, 2014

Move to new alignment loader bigdatagenomics/avocado#125

Closed

[avocado-125] Move to new alignment loader. bigdatagenomics/avocado#126

Merged

fnothaft mentioned this pull request Jan 24, 2015

Simplify ADAMContext #553

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor RDD loading; explicitly load alignments #468

refactor RDD loading; explicitly load alignments #468

ryan-williams commented Nov 7, 2014

fnothaft commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014

massie commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

fnothaft commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014

ryan-williams commented Nov 14, 2014

fnothaft commented Nov 14, 2014

ryan-williams commented Nov 14, 2014

fnothaft commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

refactor RDD loading; explicitly load alignments #468

refactor RDD loading; explicitly load alignments #468

Conversation

ryan-williams commented Nov 7, 2014

fnothaft commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014

Build result: FAILURE

massie commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

fnothaft commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

ryan-williams commented Nov 7, 2014

AmplabJenkins commented Nov 7, 2014

ryan-williams commented Nov 14, 2014

fnothaft commented Nov 14, 2014

ryan-williams commented Nov 14, 2014

fnothaft commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014