[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

jpdna · 2018-02-13T12:40:12Z

Fixes #651.

Uses genomic range binning to write partitioned parquet files, readable by Spark dataset API.
Significantly improves latency when reading filtering by genomic regions.
Replaces #1878

removed addChrPrefix parameter Address PR comments - part 1 Address PR comments - part 2 fix nits Rebased Address review comments - part 3 address reviewer comments - white space and redundant asserts fixed isPartitioned fixed codacy issue with return in isPartitioned made writePartitionedParquetFlag private updated overlap calc in refereceRegionsToDatasetQueryStirng try2 add filter in genomicRDD move filterOverlapping to alignment record another try trying as filterDatasetByOverlappingRegions move filter mapq out of AlignmentRecord clean up, ready to retest with Mango feb 12th morning added GenotypeRDD filterByOverlappingRegion support dataset region filter in all types updated ADAMContext to use filterByOverlap and added docs fix doc indent by mvn removed public referenceRegionQueryStrin function form ADAMContext

coveralls · 2018-02-13T12:57:41Z

Coverage decreased (-0.09%) to 82.587% when pulling 137ac72 on jpdna:hive_partitioned_v7 into 638fd2e on bigdatagenomics:master.

coveralls · 2018-02-13T12:57:41Z

Coverage decreased (-0.09%) to 82.587% when pulling 137ac72 on jpdna:hive_partitioned_v7 into 638fd2e on bigdatagenomics:master.

coveralls · 2018-02-13T12:57:41Z

Coverage decreased (-0.09%) to 82.587% when pulling 137ac72 on jpdna:hive_partitioned_v7 into 638fd2e on bigdatagenomics:master.

coveralls · 2018-02-13T12:57:41Z

Coverage decreased (-0.07%) to 82.603% when pulling 211072b on jpdna:hive_partitioned_v7 into 67890b8 on bigdatagenomics:master.

jpdna · 2018-02-13T13:14:09Z

Please review @heuermh, @akmorrow13 and others, and let me know changed needed.

todo: Have not yet resolved the documented ambiguous behavior if only some directories in a glob are partitioned. I suggest we leave with the documented ambiguous behavior if merge is possible for v24 this weeks and update later.

AmplabJenkins · 2018-02-13T13:22:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2663/
Test PASSed.

akmorrow13 · 2018-02-13T21:50:39Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala

@@ -26,7 +26,7 @@ import java.nio.file.Paths
 import org.apache.hadoop.fs.Path
 import org.apache.hadoop.io.LongWritable
 import org.apache.parquet.hadoop.metadata.CompressionCodecName
-import org.apache.spark.SparkContext
+import org.apache.spark.{ SparkContext }


remove brackets

AmplabJenkins · 2018-02-17T02:04:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2671/
Test PASSed.

AmplabJenkins · 2018-02-17T03:41:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2672/
Test PASSed.

AmplabJenkins · 2018-02-17T05:23:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2673/
Test PASSed.

jpdna · 2018-02-17T05:24:44Z

My recent commits here are an effort to better deal with the parameter partitionedBinSize.
This is a parameter that the user can set at write time defining the genomic bin size used when writing partitioned parquet datasets.

Prior to the previous commit, user would need to keep track of what partition size they used when they wrote a dataset, if they used other than the default of 1 mb.
This value is required at read time in order to form the query that takes advantage of the binning to improve performance.

The tracking issue is solved now by writing the partition size as an integer to the flag file _isPartitionedByStartPos which is written when a partitioned dataset is saved.

The next goal I have is for user at read time to not need to retrieve or supply the partitionBinSize when working with the dataset, because this is already set for the dataset at write time and the dataset should know its own partitionBinSize and use it. For example, filterDatasetByOverlappingRegion in the public API should not need a partionBinSize parameter.

The best way I see to deal with this is to have partitionedBinSize be a member of the DatasetBound object like DatasetBoundAlignmentRecordRDD as it is a property of the persisted dataset.
I've implemented that here, and it works fine.

My concern though is that this a specific implementation detail for partitioned parquet backed datasets, and we may not want it adding noise to the DatasetBoundAlignmentRecordRDD constructor even as an optional parameter.

Let me know if you see a better way.

pinging @fnothaft and all for this and general review

AmplabJenkins · 2018-02-17T15:28:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2674/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1911/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 026b51f # timeout=10Checking out Revision 026b51f (origin/pr/1911/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 026b51f > /home/jenkins/git2/bin/git rev-list d654134 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

jpdna · 2018-02-17T19:52:28Z

5c29dde passes all tests for me locally, any ideas why it fails above?

fnothaft

Mostly stylistic nits, otherwise generally LGTM! Thanks @jpdna!

fnothaft · 2018-02-17T21:40:38Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/TransformAlignments.scala

@@ -131,6 +131,8 @@ class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with Parqu
  var storageLevel: String = "MEMORY_ONLY"
  @Args4jOption(required = false, name = "-disable_pg", usage = "Disable writing a new @PG line.")
  var disableProcessingStep = false
+  @Args4jOption(required = false, name = "-save_as_dataset", usage = "EXPERIMENTAL: Save as a Parquet format Spark-SQL dataset")


The doc on this is a bit misleading. The partitioning is enabled by Spark SQL, but is not really Spark SQL specific. I'd say that providing this flag saves the data partitioned by genomic locus using Hive-style partitioning.

Also, "Spark SQL" vs "Spark-SQL".

Resolved - and changed parameter to an integer so user can specify bin size.

fnothaft · 2018-02-17T21:42:30Z

adam-cli/src/main/scala/org/bdgenomics/adam/cli/TransformAlignments.scala

-    outputRdd.save(args,
-      isSorted = args.sortReads || args.sortLexicographically)
+    if (args.saveAsDataset) {
+      outputRdd.saveAsPartitionedParquet(args.outputPath)


A bit irrelevant to this line, but OOC, how does saveAsPartitionedParquet handle data that isn't aligned? It seems like we may want to warn if there isn't a sequence dictionary attached.

It will save all data to partition bin number 0, and doing so will not result in any benefit.
A warning stating that has been added with log.warn

fnothaft · 2018-02-18T00:26:44Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+    val reads = loadParquetAlignments(pathName)
+
+    val datasetBoundAlignmentRecordRDD = if (regions.nonEmpty) {
+      DatasetBoundAlignmentRecordRDD(reads.dataset, reads.sequences, reads.recordGroups, reads.processingSteps, Some(partitionedBinSize))


Nit: Long line, should be broken up.

fnothaft · 2018-02-18T00:26:50Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+      DatasetBoundAlignmentRecordRDD(reads.dataset, reads.sequences, reads.recordGroups, reads.processingSteps, Some(partitionedBinSize))
+        .filterByOverlappingRegions(regions)
+    } else {
+      DatasetBoundAlignmentRecordRDD(reads.dataset, reads.sequences, reads.recordGroups, reads.processingSteps)


Nit: Long line, should be broken up.

fnothaft · 2018-02-18T00:26:59Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+    val genotypes = loadParquetGenotypes(pathName)
+
+    val datasetBoundGenotypeRDD = if (regions.nonEmpty) {
+      DatasetBoundGenotypeRDD(genotypes.dataset, genotypes.sequences, genotypes.samples, genotypes.headerLines, Some(partitionedBinSize))


Nit: Long line, should be broken up.

fnothaft · 2018-02-18T00:35:08Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+   * If a glob is used, all directories within the blog must be partitioned, and must have been saved
+   * using the same partitioned bin size.  Behavior is undefined if this requirement is not met.
+   */
+  def getPartitionedBinSize(pathName: String): Int = {


This function should be private, or at least private[rdd].

Done, but had to add a separate public isPartitioned function as because Mango needs to be able to check to see if data is partitioned.

fnothaft · 2018-02-18T00:36:34Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+        "\' and (end > " + r.start + " and start < " + r.end + "))")
+        .mkString(" or ")
+    }
+


Nit: extra whitespace.

fnothaft · 2018-02-18T00:41:21Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

+   *                                  ReferenceRegion, defaults to 1
+   * @return Returns a new DatasetBoundGenomicRDD with ReferenceRegions filter applied.
+   */
+  def filterDatasetByOverlappingRegions(querys: Iterable[ReferenceRegion],


I'd looove this to move into the new DatasetBoundGenomicDataset trait which came in @akmorrow13's 6051321, and then this would override filterByOverlappingRegions. However, this has a slightly different signature due to the addition of optPartitionedLookBackNum. I propose that we either:

Move this into DatasetBoundGenomicDataset and make this method protected, and then override filterByOverlappingRegions(querys) to call filterDatasetByOverlappingRegions(querys).

Or, keep this as is now but open a ticket to close up this API in 0.25.0.

Thoughts?

Unrelated nit: querys should be queries.

Alternatively - upon thinking about it, I don't like this optPartitionedLookbacknum as a parameter to the filterByOverlap function anyhow, as this is a config parameter tied to the dataset just like partitionedBinSize that should only need be set once when the dataset is created, so we could add it as another optional parameter to the DatasetBoundtype constructor.

If we did that, optPartitionedLookbacknum, the parameter could go away making overriding filterByOverlappingRegions override in DatasetBoudGenomicDataset work cleanly.

fnothaft · 2018-02-18T00:45:58Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/GenomicRDD.scala

@@ -2531,6 +2546,23 @@ abstract class AvroGenomicRDD[T <% IndexedRecord: Manifest, U <: Product, V <: A
    saveSequences(filePath)
  }

+  protected def referenceRegionsToDatasetQueryString(regions: Iterable[ReferenceRegion], partitionSize: Int = 1000000, partitionedLookBackNum: Int = 1): String = {


I see that a lot of downstream classes that implement GenomicDataset wind up overriding filterDatasetByOverlappingRegions with transformDataset(ds => ds.filter(referenceRegionsToDatasetQueryString(...)). I get why they have to do this, and I think fixing it is out of scope for this PR. However, can you open a ticket for us to move that code into a trait in a future release?

In update the filterDatasetByOverlappingRegion is implemented generically in trait DatasetBoundGenomicDataset , thus no more overrideing

fnothaft · 2018-02-18T00:46:55Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/fragment/FragmentRDD.scala

+  override def filterDatasetByOverlappingRegions(querys: Iterable[ReferenceRegion],
+                                                 optPartitionedLookBackNum: Option[Int] = Some(1)): FragmentRDD = {
+    transformDataset((d: Dataset[org.bdgenomics.adam.sql.Fragment]) =>
+      d.filter(referenceRegionsToDatasetQueryString(querys, partitionedBinSize.get, optPartitionedLookBackNum.get)))


This won't work for fragment, no? Fragment doesn't have contigName, start, or end fields.

you are right, removed Fragment.

jpdna · 2018-02-25T20:43:46Z

Replaced by #1922

jpdna mentioned this pull request Feb 13, 2018

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1878

Closed

akmorrow13 reviewed Feb 13, 2018

View reviewed changes

heuermh added this to the 0.24.0 milestone Feb 14, 2018

heuermh changed the title ~~hive-style partitioning of parquet files by genomic position~~ [ADAM-651] Hive-style partitioning of parquet files by genomic position Feb 14, 2018

Save partition size to isPartitioned flag file and load as metadata

c85bdae

jpdna added 2 commits February 16, 2018 21:46

Added CLI option to save as parquet dataset

f69adb5

nit

8b63059

Merge branch 'master' into hive_partitioned_v7

211072b

fixed missing extends typoe tin DataBoundAlignmentRecordRDD

5c29dde

fnothaft requested changes Feb 18, 2018

View reviewed changes

jpdna mentioned this pull request Feb 25, 2018

[ADAM-651] Implement Hive-style partitioning by genomic range of Parquet backed datasets #1922

Closed

jpdna closed this Feb 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

jpdna commented Feb 13, 2018 •

edited by heuermh

Loading

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018 •

edited

Loading

jpdna commented Feb 13, 2018

AmplabJenkins commented Feb 13, 2018

akmorrow13 Feb 13, 2018

AmplabJenkins commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

jpdna commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

jpdna commented Feb 17, 2018

fnothaft left a comment

fnothaft Feb 17, 2018

fnothaft Feb 17, 2018

jpdna Feb 24, 2018

fnothaft Feb 17, 2018

jpdna Feb 24, 2018

fnothaft Feb 18, 2018

fnothaft Feb 18, 2018

fnothaft Feb 18, 2018

fnothaft Feb 18, 2018

jpdna Feb 24, 2018

fnothaft Feb 18, 2018

fnothaft Feb 18, 2018

jpdna Feb 20, 2018

fnothaft Feb 22, 2018

fnothaft Feb 18, 2018

jpdna Feb 24, 2018

fnothaft Feb 18, 2018

jpdna Feb 24, 2018

jpdna commented Feb 25, 2018

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

Conversation

jpdna commented Feb 13, 2018 • edited by heuermh Loading

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018

coveralls commented Feb 13, 2018 • edited Loading

jpdna commented Feb 13, 2018

AmplabJenkins commented Feb 13, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

jpdna commented Feb 17, 2018

AmplabJenkins commented Feb 17, 2018

Build result: FAILURE

jpdna commented Feb 17, 2018

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpdna commented Feb 25, 2018

jpdna commented Feb 13, 2018 •

edited by heuermh

Loading

coveralls commented Feb 13, 2018 •

edited

Loading