Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucketing strategy #1553

Closed
jpdna opened this issue May 30, 2017 · 1 comment
Closed

bucketing strategy #1553

jpdna opened this issue May 30, 2017 · 1 comment
Milestone

Comments

@jpdna
Copy link
Member

jpdna commented May 30, 2017

An issue that we discussed in that past, but I am not sure if we ever prototyped:

I'd like to try writing parquet files from the soon to be ready ADAM dataset api, bucketed by 10 megabase genomic regions using df.write.bucketyby() available in Spark 2.1 , and then compare "random access" performance of lookup and joins (hopefully through partition discovery here: https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery ) use a "chr_bin" filter in a query - as compared with the normal parquet predicate pushdown.

I'll go ahead and experiment with this - but wanted to get anyone else's thoughts on if this seems viable or worthwhile. I'm hoping to achieve the effect here of a very course index.

@fnothaft
Copy link
Member

Hi @jpdna! I think this is a dupe of #651. I'm closing it as a dupe, but please reopen if you disagree.

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants