You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An issue that we discussed in that past, but I am not sure if we ever prototyped:
I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using df.write.bucketyby() available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.
I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.
The text was updated successfully, but these errors were encountered:
An issue that we discussed in that past, but I am not sure if we ever prototyped:
I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using
df.write.bucketyby()
available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.
The text was updated successfully, but these errors were encountered: