bucketing strategy #1553

jpdna · 2017-05-30T17:07:16Z

An issue that we discussed in that past, but I am not sure if we ever prototyped:

I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using df.write.bucketyby() available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.

I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.

The text was updated successfully, but these errors were encountered:

fnothaft · 2017-06-22T20:13:51Z

Hi @jpdna! I think this is a dupe of #651. I'm closing it as a dupe, but please reopen if you disagree.

fnothaft added the duplicate label Jun 22, 2017

fnothaft closed this as completed Jun 22, 2017

heuermh modified the milestone: 0.23.0 Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bucketing strategy #1553

bucketing strategy #1553

jpdna commented May 30, 2017

fnothaft commented Jun 22, 2017

bucketing strategy #1553

bucketing strategy #1553

Comments

jpdna commented May 30, 2017

fnothaft commented Jun 22, 2017