Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save/load problem: "found 0 roots with degree -1" #378

Closed
jlmelville opened this issue Apr 7, 2019 · 2 comments
Closed

save/load problem: "found 0 roots with degree -1" #378

jlmelville opened this issue Apr 7, 2019 · 2 comments

Comments

@jlmelville
Copy link

This seems related to #368 or #348 but I can't solve the problem by specifying the metric before loading. The problem is that after saving, I am unable to reload the index successfully. The load method notes found 0 roots with degree -1.

Unfortunately, I have not been able to reproduce this with a small dataset. I am using the small NORB dataset which is very high dimensional (> 10,000 columns). I realize that this is a poor use case with Annoy, although Annoy is actually very accurate with the parameters I use (it's just very slow). Assuming you can get hold of the small NORB dataset (see below), the error is triggered in quite a straightforward fashion, but only after building an index based on nearly 29,000 observations:

from annoy import AnnoyIndex

ndim = 18432
ann = AnnoyIndex(ndim, metric="euclidean")
for i in range(28852):
    ann.add_item(i, norb_mat[i, :])

ann.verbose(True)
ann.build(50)

# this works fine
print(ann.get_nns_by_item(0, 15, search_k=1500))
ann.save("norb29k.test")

ann2 = AnnoyIndex(ndim, metric="euclidean")
# this doesn't find any data
ann2.load("norb29k.test")
print(ann2.get_nns_by_item(0, 15, search_k=1500))

To get hold of the NORB dataset, the easiest way in Python is to:

then run:

norb = SmallNORBDataset(dataset_root="path/to/the/small/norb/directory")
norb_all = norb.data['train'] + norb.data['test']
norb_mat = np.array([np.hstack((obs.image_lt.reshape((96 * 96, )), obs.image_rt.reshape((96 * 96, ))))  for obs in norb_all])

The problem manifests after adding item 28852. Before that, this code saves and loads without problem. There doesn't seem to be anything special about that vector; if I skip it and use 28853 instead, the problem still manifests.

The problem also only manifests after setting n_trees=50; below that value, it works fine. But it's not a straightforward memory issue either, because I can store and use the index for the entire small NORB dataset (48,600 items) with n_trees=50, as long as I don't attempt to save it to disk and reload it.

I am using Windows 10 with Python 3.7, but I originally saw the problem with RcppAnnoy, which binds with the C++ code directly, so it's not a Python problem. I apologize for not providing a more easily reproducible example.

@jlmelville
Copy link
Author

I have dug a bit into the C++ and I have found the problem: at least on my machine, Annoy cannot read files larger than 2GB, because sizeof(off_t) is 4 bytes. For small NORB, I was creating files of around 3GB in size. When loading, n_nodes is calculated as:

    _n_nodes = (S)(size / _s);

but size has already overflowed its limits, I think.

Would it be possible to warn about this in save? Reversing the arithmetic of the load calculation, the size of the file should be _n_nodes * _s, and this should be a size_t. If its value will exceed the maximum positive value of off_t, then that means the index can't be read back in (at least by the machine that wrote it).

Apologies if this was obvious to everyone else reading this.

@jlmelville
Copy link
Author

Closing, because this was the same problem as #388 (which was fixed by #442).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant