Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inf points causes corruption with BallTree #78

Open
BioTurboNick opened this issue Nov 29, 2018 · 13 comments
Open

Inf points causes corruption with BallTree #78

BioTurboNick opened this issue Nov 29, 2018 · 13 comments
Labels

Comments

@BioTurboNick
Copy link
Contributor

To implement an exclusive nearest neighbor algorithm, I 'removed' points from my points array by setting their values to Inf and rebuilding the BallTree from the edited array. I did this to avoid copying the array each time and to preserve indexing.

However, the results output by knn are incorrect and unstable. Once the tree is created, it will always produce the same wrong output. However, regenerating the tree on the same data and trying knn again can produce different wrong output. The indexes are often some absurdly large integer such as 874275200 with an Inf distance. This is despite the fact that there are definitely points with non-infinite values in the tree that should be chosen as nearest neighbors.

The same dataset produces correct and consistent results when using BruteTree and KDTree.

I'll see if I can provide a reproduction. The issue didn't start to occur until ~half the points were Inf.

@BioTurboNick BioTurboNick changed the title Inf points causes corruption to BallTree Inf points causes corruption with BallTree Nov 29, 2018
@BioTurboNick
Copy link
Contributor Author

Hmm... in another case I observed the calculated nearest neighbor distance vary between runs even though the index was correct. Couldn't get it to reproduce reliably.

However, here's a reproduction of the above:

coords = [29882.5 25974.3 Inf  Inf 17821.8 Inf Inf Inf Inf Inf 16322.0; 9279.86 9286.35 Inf Inf 10320.4 Inf Inf Inf Inf Inf 11459.0; 0.0 0.0 Inf Inf 0.0 Inf Inf Inf Inf Inf 0.0]
point =  [17889.55, 2094.45, 0.0]
tree = BallTree(coords)
knn(tree, point, 1) # returns ([_______], [Inf])

tree = BruteTree(coords)
knn(tree, point, 1) # returns ([5], [8226.23])

tree = KDTree(coords)
knn(tree, point, 1) # returns ([5], [8226.23])

@KristofferC
Copy link
Owner

Seems like something is wrong for BallTree in the presence of Inf?

@BioTurboNick
Copy link
Contributor Author

Just for the record, I realized that the skip argument to the various methods can provide what I was trying to do with setting points to Inf.

@exaexa
Copy link

exaexa commented Feb 24, 2020

Hello!

I'm experiencing similar problems with BallTrees on perfectly finite data; it seems to happen also when some of the points from which the tree is constructed are either very close together, almost equal or equal.

Should I try to get the exact data that trigger the error? (it's pretty deep now)

Thanks

@KristofferC
Copy link
Owner

Having a small reproducer would make it easier to get to the bottom of the issue here.

@exaexa
Copy link

exaexa commented Feb 24, 2020

OK, will try ASAP (the tooling around is a bit complicated)

@nlw0
Copy link

nlw0 commented Oct 2, 2021

OK, will try ASAP (the tooling around is a bit complicated)

I was trying to reproduce the issue generating random data, but It seems fine so far. Can you at least say the number of points and dimensions you've had a problem with? I noticed before the size of the array can influence the likelihood to trigger the problem.

@exaexa
Copy link

exaexa commented Oct 2, 2021

Hi -- sorry, I kindof missed my reminder and now I see it's already a year :D The data was around 30 dimensions, roughly 1000s of points, not more. You might have luck generating the same issue with same or almost-same points, try making a balltree out of a random cloud where all values are divided by 10^9, 10^10, etc.., or so. I'll add this to the TODOs for the next week and will let you know.

@nlw0
Copy link

nlw0 commented Oct 2, 2021

I may have a MWE for this issue:

using NearestNeighbors

using LinearAlgebra: norm
using Random:randperm

Ndim = 1
Npt = 11

data = randn(Ndim, Npt)

pa = randperm(Npt)[1:2]
data[:,pa] .= Inf

display(data)

tree = BallTree(data)
@show point = randn(Ndim)
distances = mapslices(norm, data .- point, dims=1)
gtc = argmin(distances)
@show gt = gtc[2]
@show idx, dist = nn(tree, point)

display([point  data[:,[gt, idx]]])

@assert idx == gt

Example output:

julia> include("/home/user/src/nntest.jl");
1×11 Matrix{Float64}:
 -0.084283  -0.489114  -0.808527  0.245605  -0.0784464  -2.12956  Inf  -0.395751  Inf  -0.24521  -0.549262
point = randn(Ndim) = [-1.1180802696153354]
gt = gtc[2] = 3
(idx, dist) = nn(tree, point) = (6, Inf)
1×3 Matrix{Float64}:
 -1.11808  -0.808527  -2.12956
ERROR: LoadError: AssertionError: idx == gt

A number or dimensions as low as 1 works. The number of points is a little more finicky, but with 11 it's pretty reliable to cause the problem. With 10 or 12 you don't see it. I believe with more points you need more "infs" to cause the issue.

Most of the times either the first point or the last gets picked, but you can get points within the valid range as well, as in the example above.

With a single "Inf" there's no issue, you need at least 2.

@exaexa
Copy link

exaexa commented Oct 2, 2021

Interesting, I guess my issue is a bit different then, I'm pretty sure I had no Infs there. Will check and let you know.

@nlw0
Copy link

nlw0 commented Oct 3, 2021

I've looked more into this, but I'm not very knowledgeable of ball trees. I think the issue with Infs is that the model cannot naturally deal with it. If you look into create_bsphere, what need to compute a center and a radius from a set of points, and what ends up happening is that Inf values will throw a center to Inf, but then you cannot compute a radius, so we get NaN. If you look into the tree.hyper_spheres you'll see things are messed up. What should a ball tree model look like with Inf data values? I'm not sure. So maybe it's more of a case of "not a bug"...

Not sure about @exaexa 's report, I couldn't find issues with regular numbers so far, so maybe it's some subtle float arithmetics thing?

@KristofferC
Copy link
Owner

Could we just be a bit careful so instead of getting a center at Inf with a NaN radius, you get a ball at 0.0 with an Inf radius. Perhaps things can be made to "just work"...

@nlw0
Copy link

nlw0 commented Oct 5, 2021

I'm not sure there's a way to make it work naturally... If you think about it, going for a neat solution, moving a single point to infinity should maybe cause the hypersphere to turn into a hyperplane touching the first inline point of the set in the normal direction. With multiple Inf, things start getting complicated, though. And this does not seem to go in line with the idea that making points Inf we are kind of "discarding" them. They may not be returned in the query, for sure, but this seems to be contradict very strongly the assumptions of a BallTree. I'm not sure what can be done, though, should we ignore points with Inf or just leave it up to the user?... We might be able to check when the mean is computed, but perhaps the best would be to perform a first pass looking for points to be discarded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants