C-MinHash, One Permutation Hashing with Circular Permutation (Li and Li 2022 ICML), the best MinHash in terms of accuracy, which has a significant improvement for RMSE than any other MinHash-like ones. Using a circular permutation for empty bins in the sketch vector generated from OPH, so that the empty bins can have hash values borrowed from other non-empty bins with some randomness, not just copy from the nearest non-empty bins as in the densified MinHash scheme. However, the circular permutation vector can be large for large D with a regular K (MinHash K) and it has to be circularly shifted K times and applied K times, which might consume a large amount of memory. For microbial genomes, with D often 10^7 or so and K 10^4 to 10^5 to have 99% ANI accuracy, the permutation vector is not a problem.
UltraLogLog(Ertl 2023) and ExaLogLog (Ertl 2024), a significant improvement over HyperLogLog for space/memory for cardinality estimation. We can use inclusion and exclusion rule to estimate Jaccard index after obtaining cardinality of two sets despite large variance. A simple but efficient new estimator of cardinality estimation, FGRA (Further Generalized Remaining Area) based on Tau-GRA (Wang and Pettie, 2023) can be used. But we can also map UltraLogLog to corresponding HyperLogLog to use maximum likelihood estimator (MLE) for cardinality (slower), or joint maximum likelihood estimator (JMLE) for intersecion cardinality (then Jaccard index), which are slightly more accurate than FGRA and meet the Cramér-Rao lower bound. The ExaLogLog estimator is also MLE based but it is not so fast despite smaller space requirement. However, the Hyperloglog-like scheme is not under the locality sensitive hashing scheme and the RMSE is much larger than MinHash-like ones.
For extremely large genome databases, e.g. 10^9 or more genomes, the size of sketches and HNSW graph size increase linearly, which creates a challenge for memory to store/load such big files. However, it is possible/easy to sketch and build HNSW graph only for a small piece of the large genome database and create several small sketch files and HNSW graph files for searching. We can search those several small pieces one by one, then collect search results(nearest neighbors) and sorting them accoridng to the Jaccard distance. This is algorithmically equal (for both accuracy and running time) to sketching and building HNSW graph for the entire large database but requires only a small piece of memory. But how to choose the small piece so that it is computationlly efficient? Coreset algorithm, if implemented in a streaming fashion, can be used to cluster similar genomes into clusters, in a way similar to that of k-medoid but significantly more computationally efficient, see coreset crate here. We will definitely implement the idea in the near future. For now, it easy to just mannually split the large database into small pieces, similar to the real-world industrial-level application of this idea for graph-based NNS library here\

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

todo.md

todo.md

Files

todo.md

Latest commit

History

todo.md

File metadata and controls