-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misleading time complexity #5
Comments
This is incredible analysis -- really appreciate you doing that and posting it here. It's been a while now, but I had jumped to the O(log N) conclusion based either on something I'd read, or just simplistically thinking "oh, it's a tree, searching must be log N". But of course as you show that's incorrect. It worked well for me because my I'll update the README and my related article to correct this, and link to your analysis above. Thanks again! |
Made some updates. Thanks again! |
Cool, thank you for the fast reply and edit :). Now I only need to find some other metric tree which works sufficiently for my huge dataset. Maybe I'll have to look for something written in C/C++ or a python module to such a C/C++ library |
Can you do a linear search (in C)? One million is not such a huge dataset. How fast do you need the query to be? |
I didn't try a linear search in C yet, I was mostly working in python and until now it worked sufficiently fast. Even calculating the dHashes for all the images. Basically, I'd want to dedupe 100M images. I guess, if it runs in under one month, it would be fine by me, but if it runs in 5 days or so it would be even better. I have a 384-bit dHash (width=16, height=12, both vertical and horizontal gradient => |
I think the time complexity is misleading.
To check N images against each other a lookup complexity of N log(N) is advertised.
However, this is basically never the case. This would only be true if we only descended into one child for each node. That can actually happen when we call
BKTree.find( item, 0 )
, so we are only interested in items with distance 0. But as soon as you are descending into multiple children on average, you are notO(log N)
anymore. These websites also wrongly listO(log N)
for the lookup complexity. This website explicitly states thatO(log N)
is not the case. This can be easily proven wrong with benchmarks and also theoretically.For the time complexity of the ranged query
find all <= d
, consider a tree withN
elements and a distance measure with values in[0, D)
. E.g., for 32-bit integersD=32
because the hamming distance can range in[0,31]
. Assume further that the tree is pretty well-balanced and saturated, i.e., on each level the tree will haveD=32
children. For that ranged query, we calculate the distance to the current node and then have to check all children with distances in[dist_to_current - d, dist_to_current + d]
. So, as a rough estimate, we need to checkmin(2d+1,D)
children on each level.If we have to check
m
children on each level, then in total we will have to checkm^depth
children! The number of elements in such a saturated tree would beN=D^depth
, so the depth would bedepth=ln(N)/ln(D)
. Combining this, we getm^[ln(N)/ln(D)] = N^[ln(m)/ln(D)]
lookups. So, clearly a power law with some broken valued exponent and not a logarithmic scaling!For a lookup to
find all <= d
, we havem=min(2d+1,D)
resulting in a lookup complexity ofN^[ln(min(2d+1,D))/ln(D)]
. ForD=32
(32 bit integers), we would get these scalings:d=0
:O(log(N))
d=1
:O(N^0.32)
d=2
:O(N^0.46)
d=4
:O(N^0.63)
d=8
:O(N^0.82)
d=16
:O(N)
or for
D=64
:d=0
:O(log(N))
d=1
:O(N^0.26)
d=2
:O(N^0.39)
d=4
:O(N^0.53)
d=8
:O(N^0.68)
d=16
:O(N^0.84)
d=32
:O(N)
For
d>=D/2
, we would getO(N)
scaling because we would have to look at each element in the tree on average with the given assumptions. The assumption, which is encoded inm=min(2d+1,D)
, is that the distance to each node has to be exactlyD/2
, which on average should be true for looking up random values.How to check this? Well, let's benchmark:
benchmark-pybktree.py
Here are the results:
The plot is in log-log scale. In log-log scale all power laws in the form of
f(x)=ax^b
become linear functions in the form oflog(a) + b*x
. The dashed lines are fits to those linear functions and here are the results for the fitted scaling laws:0.97e-6s N^1.14
d=0
:1.80e-6s N^0.27
d=1
:1.81e-6s N^0.38
d=2
:1.51e-6s N^0.52
d=4
:1.21e-6s N^0.73
d=8
:1.00e-6s N^0.92
d=16
:0.98e-6s N^0.99
For some reason, the exponents seem to be systematically underestimated. But, it can clearly be seen that all except
d=0
follow power laws instead of logarithmic scaling. As can be seen in the code, the benchmark ran with 64 bits. So, assuming you have a 64 bit hash and you want to find all with a distance <= 16, you already enter O(N) territory and the bktree becomes useless.Conclusion: This is less suited than I thought for a lookup in 1M+ files and even less suited if your threshold distance is quite high, which it is in my case. I'm basically edging close to
O(N)
because of this.The text was updated successfully, but these errors were encountered: