Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n_unique_kmers() returns 0 upon save/reload #1781

Open
dkoslicki opened this issue Sep 12, 2017 · 5 comments
Open

n_unique_kmers() returns 0 upon save/reload #1781

dkoslicki opened this issue Sep 12, 2017 · 5 comments

Comments

@dkoslicki
Copy link

I noticed that using the Nodegraph method n_unique_kmers() returns 0 whenever a Nodegraph is saved and then reloaded. Minimum working example:

import khmer
import tempfile
import os
temp_file = tempfile.mktemp()
ng = khmer.Nodegraph(5, 1e5, 4)
ng.add("AAAAA")
print("Number k-mers: %d" % ng.n_unique_kmers())  # Returns correct answer of 1
ng.save(temp_file)
# Reload and try again
ng2 = khmer.load_nodegraph(temp_file)
print("Number k-mers after load: %d" % ng2.n_unique_kmers())  # Returns incorrect answer of 0
os.remove(temp_file)  # clean up

Same thing happens on realistic data.

Oddly enough, the method n_occupied() doesn't seem to have this issue:

print("Number occupied: %d" % ng.n_occupied())
print("Number occupied after load: %d" % ng2.n_occupied())

As an aside: it isn't clear to me (from the documentation) that one should use n_unique_kmers() or n_occupied() if one wants to estimate the number of (distinct) k-mers that are present in a Nodegraph().

@betatim
Copy link
Member

betatim commented Sep 12, 2017

Thanks for finding this. Which version of khmer did you use? I managed to reproduce this on master with the slight modification of:

ng3 = khmer.Nodegraph.load(temp_file)

@betatim
Copy link
Member

betatim commented Sep 12, 2017

n_unique_kmers isn't stored in the saved file, which is why it comes out as zero after loading a file. Does anyone remember why this was done or simply an oversight?

I think you want to use n_unique_kmers for estimating distinc k-mers. n_occupied keeps track of how many bits are turned on in the first table used in the Nodegraph, not all of them.

@dkoslicki
Copy link
Author

It looks like I'm using 2.1.1 (which is what pip install khmer gives).
Good to know about n_unique_kmers versus n_occupied; the source of my confusion was the fact that (when creating a Nodegraph, not loading one), n_unique_kmers and n_occupied were always returning the same value on real data (though I have subsequently found an artificial example where they differ).

@ctb
Copy link
Member

ctb commented Oct 11, 2017

This is an oversight, @betatim!

@dkoslicki, the HyperLogLog stuff (in unique-kmers.py and elsewhere) is probably the best way to go for estimating k-mer counts, if that's feasible. @luizirber may have some other thoughts as I know he is working on similar stuff for sourmash.

Also, #1248 needs to be linked in...

@dkoslicki
Copy link
Author

Thanks @ctb; I am already forming the Nodegraph's, so thought n_unique_kmers would be a handy way to circumvent repeat calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants