Support Hive-style partitioning #651

tomwhite · 2015-04-15T16:52:47Z

It's common to partition sequence data by locus. This change would make it possible to partition genotypes (alignments, etc) using a Hive-compatible directory structure like chr=M/pos=N (N is something like floor(position/10^K)).

Querying would then be more efficient since the SQL engine would only need to read files in the partitions of interest (typically one partition when doing a point query).

The text was updated successfully, but these errors were encountered:

fnothaft · 2015-04-15T16:55:50Z

Don't these partitioners satisfy the requirements?

tomwhite · 2015-04-15T17:12:14Z

Possibly, although I think there's more work to write out data in a Hive-compatible directory structure. Or have you done that kind of thing already?

fnothaft · 2015-04-15T17:13:33Z

Ah, no, haven't done that!

tomwhite · 2015-04-15T17:15:15Z

OK, I'll take a look to see how to do that.

tomwhite · 2015-04-16T17:16:05Z

Here's an initial go that uses Kite to do partitioning:

master...tomwhite:ADAM-651-hive-partitions

The idea is to add a partition command that takes a partition strategy file that defines which fields are used in constructing the partitions, and how the records map to partition values. For example, this partition strategy partitions by chromosome and bucketed position.

This is not quite ready yet, as it depends on a snapshot version of Kite for https://issues.cloudera.org/browse/CDK-986 and https://issues.cloudera.org/browse/CDK-988.

fnothaft · 2015-04-16T17:25:27Z

Nice; looks good! I hadn't seen Kite before; I'll need to take more of a look.

laserson · 2015-04-16T18:47:59Z

As discussed, let's add command line options that specify a few different baked in partition schemes, such as locus-partitioned or sample-partitioned.

tomwhite · 2015-04-21T14:28:57Z

I updated my branch to use a partitioner to ensure that each reducer only writes to a small number of Parquet files. This makes sure that they reducers don't run out of memory.

When I tried this on a cluster against the 1000 genomes chr22 file (and a partition range size of 10^6) I got FetchFailedExceptions, as described here: https://issues.apache.org/jira/browse/SPARK-5928. I tried running with ADAM_OPTS='--conf spark.io.compression.codec=lz4' as this was reported to have helped, but the job still failed in the same way. Does anyone have any suggestions for how to get around this problem?

tomwhite · 2015-04-27T16:28:18Z

The problem here turns out to be #599: when data is read from Parquet it blows up in the shuffle to such an extent that shuffle blocks exceed 2GB even for modest sized input splits. For example, one 7.6MB input file resulted in roughly 6.8GB of shuffle data, a 1000x increase in size.

Matt is working on a fix for #599, which will solve this problem.

fnothaft · 2015-04-27T16:38:02Z

@tomwhite was that on genotype data?

tomwhite · 2015-04-27T16:47:59Z

@fnothaft yes

fnothaft · 2015-04-27T16:54:54Z

Out of curiosity, do you know what your shuffle size/performance is if you use GZIP compression for shuffle (e.g., https://github.com/bigdatagenomics/utils/blob/master/utils-serialization/src/main/scala/org/bdgenomics/utils/serialization/compression/GzipCompressionCodec.scala).

tomwhite · 2015-04-27T20:40:21Z

Good suggestion, I didn't realize there was a gzip codec for Spark. With gzip, the shuffle data is 478MB, so 60x, but it's 3x slower. I need to try this out on a cluster to see if the fetch failure problem is avoided.

tomwhite · 2015-04-28T14:30:57Z

I tried running on a cluster with the 1000 genomes chr22 file, but it didn't finish after >2 hours. So I think we need #599.

fnothaft · 2015-04-28T15:09:32Z

Agreed; the gzip codec is just a stopgap until the solution to #599 is ready.

heuermh · 2016-07-21T03:35:43Z

@tomwhite We have a workaround implemented for #599, the Kite issues mentioned above were resolved in version 1.1.0, and your branch looks generic enough that it shouldn't run into too many problems on a rebase.

By eyeball I think the only necessary change would be
variant.contig.contigName → variant.contigName

Might you have some time to update your branch and make a pull request?

tomwhite · 2016-07-22T09:40:31Z

@heuermh I would actually favour the Spark dataframes/datasets route to doing partitioning, since it's better supported than Kite. Also, flattening is no longer necessary since the major Hadoop SQL engines support nested types now (Impala didn't when I wrote my branch).

BTW what's the workaround for #599?

heuermh · 2016-07-22T21:02:03Z

@tomwhite Thank you for the reply. It is my understanding that we're not going to push too hard in the direction of Spark 2.0 dataframes/datasets until after we get our version 1.0 out. Should we keep this issue open to revisit at that time?

what's the workaround for #599?

We write sequence dictionary, record group, and sample metadata out to disk in avro format separate from the parquet files, see merged pull requests #906 leading to #1051.

fnothaft · 2017-03-03T23:37:47Z

This should be easy to do on top of #1216 or #1391.

jpdna · 2017-07-10T21:11:21Z

Playing with this at the shell, I find that adding bucketing dir hierarchy on parquet write where we write parquet from spark-sql like at:
https://github.com/fnothaft/adam/blob/issues/1018-dataset-api/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDD.scala#L193

such as:

rdd.dataset.toDF()
.withColumn("posBin",floor(x.dataset.col("start")/10000))
.write.partitionBy("contigName","posBin").format("parquet")
.option("spark.sql.parquet.compression.codec", "GZIP".toString.toLowerCase()).save("jptest33")

seems to produce the desired bucketed output by chr and then by 10 KB chunk, and interestingly still passes existing tests.
I'll experiment more and see if we can see a performance improvement when querying a small region by position. Note: Spark 2.1.- required

jpdna · 2017-07-10T22:34:10Z

So - it looks like if we add a posBin column to dataframe, which bins into 10,000 bp chunks with floor we are going to need to add a posBin field to Avro bdg-formats like to AlignmentRecord. This could be a price worth paying if there is a performance improvement, thoughts?

fnothaft · 2017-07-10T22:53:39Z

Interesting! Let me think about it for a bit. An alternative would be to force a load through the SQL APIs and to drop the posBin column before converting to RDD form.

jpdna · 2017-07-10T23:26:23Z

First attempt can be seen here: fnothaft#17

fnothaft · 2017-07-10T23:40:50Z

Nice! I will take a deeper look later.

Resolves #651.

tomwhite mentioned this issue Apr 16, 2015

Generate partitioned data bigdatagenomics/eggo#30

Closed

tomwhite mentioned this issue Jun 4, 2015

upgrade to Spark 1.3.1 #690

Closed

fnothaft added the pick me up! label Jul 6, 2016

fnothaft mentioned this issue Jul 6, 2016

Sub-partitioning of Parquet file for ADAM #1003

Closed

jpdna mentioned this issue Sep 6, 2016

Parquet storage of VariantContext #1151

Closed

jpdna mentioned this issue Oct 24, 2016

Maintaining sorted/partitioned knowledge #1216

Closed

fnothaft modified the milestones: 0.24.0, 0.23.0 Mar 3, 2017

fnothaft modified the milestones: 0.24.0, 0.23.0 May 12, 2017

fnothaft mentioned this issue Jun 22, 2017

bucketing strategy #1553

Closed

jpdna mentioned this issue Jul 10, 2017

added contigName Hive style partitioning to AlignmentRecordRDD fnothaft/adam#17

Open

jpdna mentioned this issue Jul 21, 2017

add hive style partitioning for contigName #1620

Closed

jpdna mentioned this issue Jan 22, 2018

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1878

Closed

jpdna mentioned this issue Feb 14, 2018

[ADAM-651] Hive-style partitioning of parquet files by genomic position #1911

Closed

fnothaft mentioned this issue Mar 14, 2018

[ADAM-651] Implement Hive-style partitioning by genomic range of Parquet backed datasets #1948

Merged

fnothaft pushed a commit that referenced this issue Mar 14, 2018

[ADAM-651] Add support for Hive-style partitioning.

452d760

Resolves #651.

heuermh closed this as completed in #1948 Mar 14, 2018

heuermh pushed a commit that referenced this issue Mar 14, 2018

[ADAM-651] Add support for Hive-style partitioning.

c54f1a1

Resolves #651.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Hive-style partitioning #651

Support Hive-style partitioning #651

tomwhite commented Apr 15, 2015

fnothaft commented Apr 15, 2015

tomwhite commented Apr 15, 2015

fnothaft commented Apr 15, 2015

tomwhite commented Apr 15, 2015

tomwhite commented Apr 16, 2015

fnothaft commented Apr 16, 2015

laserson commented Apr 16, 2015

tomwhite commented Apr 21, 2015

tomwhite commented Apr 27, 2015

fnothaft commented Apr 27, 2015

tomwhite commented Apr 27, 2015

fnothaft commented Apr 27, 2015

tomwhite commented Apr 27, 2015

tomwhite commented Apr 28, 2015

fnothaft commented Apr 28, 2015

heuermh commented Jul 21, 2016

tomwhite commented Jul 22, 2016

heuermh commented Jul 22, 2016

fnothaft commented Mar 3, 2017

jpdna commented Jul 10, 2017

jpdna commented Jul 10, 2017

fnothaft commented Jul 10, 2017 •

edited

Loading

jpdna commented Jul 10, 2017

fnothaft commented Jul 10, 2017

Support Hive-style partitioning #651

Support Hive-style partitioning #651

Comments

tomwhite commented Apr 15, 2015

fnothaft commented Apr 15, 2015

tomwhite commented Apr 15, 2015

fnothaft commented Apr 15, 2015

tomwhite commented Apr 15, 2015

tomwhite commented Apr 16, 2015

fnothaft commented Apr 16, 2015

laserson commented Apr 16, 2015

tomwhite commented Apr 21, 2015

tomwhite commented Apr 27, 2015

fnothaft commented Apr 27, 2015

tomwhite commented Apr 27, 2015

fnothaft commented Apr 27, 2015

tomwhite commented Apr 27, 2015

tomwhite commented Apr 28, 2015

fnothaft commented Apr 28, 2015

heuermh commented Jul 21, 2016

tomwhite commented Jul 22, 2016

heuermh commented Jul 22, 2016

fnothaft commented Mar 3, 2017

jpdna commented Jul 10, 2017

jpdna commented Jul 10, 2017

fnothaft commented Jul 10, 2017 • edited Loading

jpdna commented Jul 10, 2017

fnothaft commented Jul 10, 2017

fnothaft commented Jul 10, 2017 •

edited

Loading