You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This seems related to #368 or #348 but I can't solve the problem by specifying the metric before loading. The problem is that after saving, I am unable to reload the index successfully. The load method notes found 0 roots with degree -1.
Unfortunately, I have not been able to reproduce this with a small dataset. I am using the small NORB dataset which is very high dimensional (> 10,000 columns). I realize that this is a poor use case with Annoy, although Annoy is actually very accurate with the parameters I use (it's just very slow). Assuming you can get hold of the small NORB dataset (see below), the error is triggered in quite a straightforward fashion, but only after building an index based on nearly 29,000 observations:
fromannoyimportAnnoyIndexndim=18432ann=AnnoyIndex(ndim, metric="euclidean")
foriinrange(28852):
ann.add_item(i, norb_mat[i, :])
ann.verbose(True)
ann.build(50)
# this works fineprint(ann.get_nns_by_item(0, 15, search_k=1500))
ann.save("norb29k.test")
ann2=AnnoyIndex(ndim, metric="euclidean")
# this doesn't find any dataann2.load("norb29k.test")
print(ann2.get_nns_by_item(0, 15, search_k=1500))
To get hold of the NORB dataset, the easiest way in Python is to:
The problem manifests after adding item 28852. Before that, this code saves and loads without problem. There doesn't seem to be anything special about that vector; if I skip it and use 28853 instead, the problem still manifests.
The problem also only manifests after setting n_trees=50; below that value, it works fine. But it's not a straightforward memory issue either, because I can store and use the index for the entire small NORB dataset (48,600 items) with n_trees=50, as long as I don't attempt to save it to disk and reload it.
I am using Windows 10 with Python 3.7, but I originally saw the problem with RcppAnnoy, which binds with the C++ code directly, so it's not a Python problem. I apologize for not providing a more easily reproducible example.
The text was updated successfully, but these errors were encountered:
I have dug a bit into the C++ and I have found the problem: at least on my machine, Annoy cannot read files larger than 2GB, because sizeof(off_t) is 4 bytes. For small NORB, I was creating files of around 3GB in size. When loading, n_nodes is calculated as:
_n_nodes = (S)(size / _s);
but size has already overflowed its limits, I think.
Would it be possible to warn about this in save? Reversing the arithmetic of the load calculation, the size of the file should be _n_nodes * _s, and this should be a size_t. If its value will exceed the maximum positive value of off_t, then that means the index can't be read back in (at least by the machine that wrote it).
Apologies if this was obvious to everyone else reading this.
This seems related to #368 or #348 but I can't solve the problem by specifying the
metric
before loading. The problem is that after saving, I am unable to reload the index successfully. The load method notesfound 0 roots with degree -1
.Unfortunately, I have not been able to reproduce this with a small dataset. I am using the small NORB dataset which is very high dimensional (> 10,000 columns). I realize that this is a poor use case with Annoy, although Annoy is actually very accurate with the parameters I use (it's just very slow). Assuming you can get hold of the small NORB dataset (see below), the error is triggered in quite a straightforward fashion, but only after building an index based on nearly 29,000 observations:
To get hold of the NORB dataset, the easiest way in Python is to:
then run:
The problem manifests after adding item 28852. Before that, this code saves and loads without problem. There doesn't seem to be anything special about that vector; if I skip it and use 28853 instead, the problem still manifests.
The problem also only manifests after setting
n_trees=50
; below that value, it works fine. But it's not a straightforward memory issue either, because I can store and use the index for the entire small NORB dataset (48,600 items) withn_trees=50
, as long as I don't attempt to save it to disk and reload it.I am using Windows 10 with Python 3.7, but I originally saw the problem with RcppAnnoy, which binds with the C++ code directly, so it's not a Python problem. I apologize for not providing a more easily reproducible example.
The text was updated successfully, but these errors were encountered: