Performance benchmarking #7

c42f · 2015-11-10T12:09:37Z

As discussed in JuliaGeometry/KDTrees.jl#20, nearest neighbor performance can depend rather strongly on the data distribution. For real data, there's various "interesting" cases which occur quite commonly, including at least:

Query points can systematically be outliers relative to the data in the tree. This reduces the efficiency of early bailout during tree traversal.
Data in the tree can have outliers. This can effect the current tree building heuristic by forcing a lot of long skinny rectangles.
Data in the tree can reside on subspaces in the higher dimensional space. It would be interesting to see how this effects query efficiency relative to the same data in lower dimensions.

Is there a plan/preference for where benchmark code should go, or what form it should take? (#3 obviously relevant).

KristofferC · 2015-11-10T13:05:48Z

There is no real concrete plan right now but here are some thoughts.

My suggestion is we use the https://github.com/johnmyleswhite/Benchmarks.jl package which seems to start becoming the de facto benchmarking package that other people build their benchmarking infrastructure on top of.

The first step would just be to get some simple benchmarks going with a few different simple data distributions. In https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/ the author tested a few types of trees with different data distributions. Getting something like that going (without any regression testing or advanced stuff) would be good as a first step.

We could also see if we get similar results as in the link. The author mentions that the top down approach of creating rectangles (which is also used in this package) can lead to bad performance in large dimensions so that is something worth looking into. I also know that the BallTree is not as optimized as it could be in this package (for example using reduced distance when the metric is of Minkowsky type).

Second step could be to get some representative real life data that exhibits the problems mentioned above and see if there are any unexpected performance drops.

Third step is to integrate it with BenchmarksTrackers.jl and get some regression testing going.

KristofferC · 2016-08-24T10:42:04Z

The framework from #30 is now available so just add benchmarks there as needed.

c42f · 2016-08-24T10:48:33Z

Great, thanks for putting that together. As you can probably tell I haven't had time to spend on this for ages :-/ Hopefully I'll have time again in the future.

c42f mentioned this issue Nov 10, 2015

Outliers and knn performance with full_rec_dim JuliaGeometry/KDTrees.jl#20

Closed

KristofferC closed this as completed Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarking #7

Performance benchmarking #7

c42f commented Nov 10, 2015

KristofferC commented Nov 10, 2015

KristofferC commented Aug 24, 2016

c42f commented Aug 24, 2016

Performance benchmarking #7

Performance benchmarking #7

Comments

c42f commented Nov 10, 2015

KristofferC commented Nov 10, 2015

KristofferC commented Aug 24, 2016

c42f commented Aug 24, 2016