save/load problem: "found 0 roots with degree -1" #378

jlmelville · 2019-04-07T02:53:01Z

This seems related to #368 or #348 but I can't solve the problem by specifying the metric before loading. The problem is that after saving, I am unable to reload the index successfully. The load method notes found 0 roots with degree -1.

Unfortunately, I have not been able to reproduce this with a small dataset. I am using the small NORB dataset which is very high dimensional (> 10,000 columns). I realize that this is a poor use case with Annoy, although Annoy is actually very accurate with the parameters I use (it's just very slow). Assuming you can get hold of the small NORB dataset (see below), the error is triggered in quite a straightforward fashion, but only after building an index based on nearly 29,000 observations:

from annoy import AnnoyIndex

ndim = 18432
ann = AnnoyIndex(ndim, metric="euclidean")
for i in range(28852):
    ann.add_item(i, norb_mat[i, :])

ann.verbose(True)
ann.build(50)

# this works fine
print(ann.get_nns_by_item(0, 15, search_k=1500))
ann.save("norb29k.test")

ann2 = AnnoyIndex(ndim, metric="euclidean")
# this doesn't find any data
ann2.load("norb29k.test")
print(ann2.get_nns_by_item(0, 15, search_k=1500))

To get hold of the NORB dataset, the easiest way in Python is to:

install numpy, matplotlib, scipy, tqdm
use the class defined in https://github.com/ndrplz/small_norb
download and unzip the six small NORB files into a single directory

then run:

norb = SmallNORBDataset(dataset_root="path/to/the/small/norb/directory")
norb_all = norb.data['train'] + norb.data['test']
norb_mat = np.array([np.hstack((obs.image_lt.reshape((96 * 96, )), obs.image_rt.reshape((96 * 96, ))))  for obs in norb_all])

The problem manifests after adding item 28852. Before that, this code saves and loads without problem. There doesn't seem to be anything special about that vector; if I skip it and use 28853 instead, the problem still manifests.

The problem also only manifests after setting n_trees=50; below that value, it works fine. But it's not a straightforward memory issue either, because I can store and use the index for the entire small NORB dataset (48,600 items) with n_trees=50, as long as I don't attempt to save it to disk and reload it.

I am using Windows 10 with Python 3.7, but I originally saw the problem with RcppAnnoy, which binds with the C++ code directly, so it's not a Python problem. I apologize for not providing a more easily reproducible example.

The text was updated successfully, but these errors were encountered:

jlmelville · 2019-04-07T06:51:19Z

I have dug a bit into the C++ and I have found the problem: at least on my machine, Annoy cannot read files larger than 2GB, because sizeof(off_t) is 4 bytes. For small NORB, I was creating files of around 3GB in size. When loading, n_nodes is calculated as:

    _n_nodes = (S)(size / _s);

but size has already overflowed its limits, I think.

Would it be possible to warn about this in save? Reversing the arithmetic of the load calculation, the size of the file should be _n_nodes * _s, and this should be a size_t. If its value will exceed the maximum positive value of off_t, then that means the index can't be read back in (at least by the machine that wrote it).

Apologies if this was obvious to everyone else reading this.

jlmelville · 2020-03-09T02:49:13Z

Closing, because this was the same problem as #388 (which was fixed by #442).

jlmelville mentioned this issue Apr 7, 2019

CRAN submission jlmelville/uwot#2

Closed

jlmelville closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save/load problem: "found 0 roots with degree -1" #378

save/load problem: "found 0 roots with degree -1" #378

jlmelville commented Apr 7, 2019

jlmelville commented Apr 7, 2019

jlmelville commented Mar 9, 2020

save/load problem: "found 0 roots with degree -1" #378

save/load problem: "found 0 roots with degree -1" #378

Comments

jlmelville commented Apr 7, 2019

jlmelville commented Apr 7, 2019

jlmelville commented Mar 9, 2020