-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Hive-style partitioning #651
Comments
Don't these partitioners satisfy the requirements? |
Possibly, although I think there's more work to write out data in a Hive-compatible directory structure. Or have you done that kind of thing already? |
Ah, no, haven't done that! |
OK, I'll take a look to see how to do that. |
Here's an initial go that uses Kite to do partitioning: master...tomwhite:ADAM-651-hive-partitions The idea is to add a partition command that takes a partition strategy file that defines which fields are used in constructing the partitions, and how the records map to partition values. For example, this partition strategy partitions by chromosome and bucketed position. This is not quite ready yet, as it depends on a snapshot version of Kite for https://issues.cloudera.org/browse/CDK-986 and https://issues.cloudera.org/browse/CDK-988. |
Nice; looks good! I hadn't seen Kite before; I'll need to take more of a look. |
As discussed, let's add command line options that specify a few different baked in partition schemes, such as locus-partitioned or sample-partitioned. |
I updated my branch to use a partitioner to ensure that each reducer only writes to a small number of Parquet files. This makes sure that they reducers don't run out of memory. When I tried this on a cluster against the 1000 genomes chr22 file (and a partition range size of 10^6) I got FetchFailedExceptions, as described here: https://issues.apache.org/jira/browse/SPARK-5928. I tried running with |
The problem here turns out to be #599: when data is read from Parquet it blows up in the shuffle to such an extent that shuffle blocks exceed 2GB even for modest sized input splits. For example, one 7.6MB input file resulted in roughly 6.8GB of shuffle data, a 1000x increase in size. Matt is working on a fix for #599, which will solve this problem. |
@tomwhite was that on genotype data? |
@fnothaft yes |
Out of curiosity, do you know what your shuffle size/performance is if you use GZIP compression for shuffle (e.g., https://github.com/bigdatagenomics/utils/blob/master/utils-serialization/src/main/scala/org/bdgenomics/utils/serialization/compression/GzipCompressionCodec.scala). |
Good suggestion, I didn't realize there was a gzip codec for Spark. With gzip, the shuffle data is 478MB, so 60x, but it's 3x slower. I need to try this out on a cluster to see if the fetch failure problem is avoided. |
I tried running on a cluster with the 1000 genomes chr22 file, but it didn't finish after >2 hours. So I think we need #599. |
Agreed; the gzip codec is just a stopgap until the solution to #599 is ready. |
@tomwhite We have a workaround implemented for #599, the Kite issues mentioned above were resolved in version 1.1.0, and your branch looks generic enough that it shouldn't run into too many problems on a rebase. By eyeball I think the only necessary change would be Might you have some time to update your branch and make a pull request? |
@heuermh I would actually favour the Spark dataframes/datasets route to doing partitioning, since it's better supported than Kite. Also, flattening is no longer necessary since the major Hadoop SQL engines support nested types now (Impala didn't when I wrote my branch). BTW what's the workaround for #599? |
@tomwhite Thank you for the reply. It is my understanding that we're not going to push too hard in the direction of Spark 2.0 dataframes/datasets until after we get our version 1.0 out. Should we keep this issue open to revisit at that time?
We write sequence dictionary, record group, and sample metadata out to disk in avro format separate from the parquet files, see merged pull requests #906 leading to #1051. |
Playing with this at the shell, I find that adding bucketing dir hierarchy on parquet write where we write parquet from spark-sql like at: such as:
seems to produce the desired bucketed output by chr and then by 10 KB chunk, and interestingly still passes existing tests. |
So - it looks like if we add a |
Interesting! Let me think about it for a bit. An alternative would be to force a load through the SQL APIs and to drop the |
First attempt can be seen here: fnothaft#17 |
Nice! I will take a deeper look later. |
It's common to partition sequence data by locus. This change would make it possible to partition genotypes (alignments, etc) using a Hive-compatible directory structure like chr=M/pos=N (N is something like
floor(position/10^K)
).Querying would then be more efficient since the SQL engine would only need to read files in the partitions of interest (typically one partition when doing a point query).
The text was updated successfully, but these errors were encountered: