Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bindash support instead of dashing #27

Open
jianshu93 opened this issue Mar 10, 2023 · 1 comment
Open

add bindash support instead of dashing #27

jianshu93 opened this issue Mar 10, 2023 · 1 comment

Comments

@jianshu93
Copy link

Hello Ben,

I was investigating MinHash algorithm heavily in the past several months. In terms of simple minhash, that is to estimate jaccard in traditional manner, b-bit One Permutation MinHash with optimal densification (https://dl.acm.org/doi/abs/10.1145/1772690.1772759, https://proceedings.neurips.cc/paper/2012/file/eaa32c96f620053cf442ad32258076b9-Paper.pdf ,http://proceedings.mlr.press/v70/shrivastava17a.html) represents the most space and time efficient algorithm among all others, including hyperloglog. It was implemented in the bindash software (https://academic.oup.com/bioinformatics/article/35/4/671/5058094), since Xiaofei left academia, it was not further developed as dashing was (dashing 2 for example). However, after several experiments, e.g. all versus all distance computation for all NCBI genomes, bindash is the fastest (I use kmer 16 and sketch size 12000 to have 95% ANI level accuracy) I have ever seen, about 2 times faster than dashing. It supports only nucleotide but not amino acid as dashing and Mash do. I would suggest do not use finch because it is memory inefficient for large number of genomes. What do you think.

Thanks,

Jianshu

@jianshu93
Copy link
Author

Hello Ben,

It has been a long time since my last message. Just want to make sure that the suggestions above make sense to you. Bindash is by far the fastest and also the most accurate MinHash like algorithm, better than hyperloglog due to smaller variance. Calling bindash instead of dashing should be very easy because the output is the Mash index, 1- index will then be ANI for clustering.

Thanks,

Jianshu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant