-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance benchmarking #7
Comments
There is no real concrete plan right now but here are some thoughts. My suggestion is we use the https://github.com/johnmyleswhite/Benchmarks.jl package which seems to start becoming the de facto benchmarking package that other people build their benchmarking infrastructure on top of. The first step would just be to get some simple benchmarks going with a few different simple data distributions. In https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/ the author tested a few types of trees with different data distributions. Getting something like that going (without any regression testing or advanced stuff) would be good as a first step. We could also see if we get similar results as in the link. The author mentions that the top down approach of creating rectangles (which is also used in this package) can lead to bad performance in large dimensions so that is something worth looking into. I also know that the BallTree is not as optimized as it could be in this package (for example using reduced distance when the metric is of Minkowsky type). Second step could be to get some representative real life data that exhibits the problems mentioned above and see if there are any unexpected performance drops. Third step is to integrate it with BenchmarksTrackers.jl and get some regression testing going. |
The framework from #30 is now available so just add benchmarks there as needed. |
Great, thanks for putting that together. As you can probably tell I haven't had time to spend on this for ages :-/ Hopefully I'll have time again in the future. |
As discussed in JuliaGeometry/KDTrees.jl#20, nearest neighbor performance can depend rather strongly on the data distribution. For real data, there's various "interesting" cases which occur quite commonly, including at least:
Is there a plan/preference for where benchmark code should go, or what form it should take? (#3 obviously relevant).
The text was updated successfully, but these errors were encountered: