Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved constraint on centroid sizes #40

Closed
oertl opened this issue Jan 28, 2015 · 9 comments
Closed

Improved constraint on centroid sizes #40

oertl opened this issue Jan 28, 2015 · 9 comments

Comments

@oertl
Copy link

oertl commented Jan 28, 2015

I propose an improved condition, whether neighboring centroids can be merged or not: Let {(x_1, w_1), (x_2, w_2),...,(x_N, w_N)} be the centroids in ascending order where x_i is the mean and w_i is the weight of the i-th centroid. Let W be the sum of all weights W := w_1 + w_2 +...+ w_N.

The original t-digest condition is equivalent to w_i/W*f(q_i) <= 4*delta, where f(q):=1/(q*(1-q)) and deltais the reciprocal value of the compression.

Instead, I propose to use the constraint int_{z_i}^{z_{i+1}} f(q) dq <= 4*delta with z_i:=(sum_{j<i} w_i)/W. Note, that this inequality is more strict than the original constraint due to Jensen's inequality and because f(q) is a convex function. The integral of f(q) from z_i to z_{i+1} can be solved analytically and the new constraint can be expressed as
ln(z_{i+1}/(1-z_{i+1})) - ln(z_{i}/(1-z_{i})) <= 4*delta or equivalently as (z_{i+1}-z_{i})<=(e^{4*delta}-1)*z_{i}*(1-z_{i+1}). The last inequality can be evaluated very efficiently, if the constant first factor on the right hand side is precalculated.

Since the integral of f(q) is indefinite at boundaries 0 and 1, the new constraint inhibits in a natural way merging with the first or the last centroid. This is not the case for the original constraint.

@tdunning
Copy link
Owner

tdunning commented Feb 1, 2015

I have been traveling too much lately to think well, so this might be the reason for my not understanding the virture of this alternative constraint.

Can you say in a sentence why this constraint is better? Also, how would pluggable constraints interact with this suggestion?

@oertl
Copy link
Author

oertl commented Feb 1, 2015

The proposed constraint is slightly more accurate near 0 and 1. The main difference between both constraints is the way the integral is calculated. The original constraint is obtained by approximating the integral using the rectangle rule. Therefore, and because f(q) is convex, the integral is slightly underestimated and the resulting inequality is less strict. For example, consider 10 centroids each with weight 1 and delta = 0.6. The original constraint would allow to merge the first two centroids while the proposed one would not.

To solve the integral for arbitrary scaling laws, which cannot be done using analytical means, an approriate numerical quadrature rule needs to be applied. As replacement of the rectangle rule, the trapezoidal rule could be used instead. The trapezoidal rule has the nice property that it overestimates in case of convex functions, which leads to a more strict constraint.

@oertl
Copy link
Author

oertl commented Feb 1, 2015

Using the new constraint and the q*(1-q) scaling law I have found an interesting relationship:

Assume the i-th centroid corresponds to a histogram bin ranging from z_ito z_{i+1). In case of the q*(1-q) scaling law and the new constraint we got ln(z_{i+1}/(1-z_{i+1})) - ln(z_{i}/(1-z_{i})) <= 4*delta. Hence, this inequality limits the bin width on the logit scale. Interestingly, the authors of the incremental quantile estimation algorithm (http://projecteuclid.org/download/pdfview_1/euclid.ss/1177334520) propose the same, to use equi-distant points on the logit scale for the approximation of the CDF.

@oertl
Copy link
Author

oertl commented Feb 1, 2015

For pluggable centroid scaling laws it could make sense to provide the antiderivative instead. The antiderivative describes the scale on which you want to limit the histogram bin sizes. For example, ln(z/(1-z)) corresponds to the q*(1-q) law, arcsin(2*z-1) corresponds to the sqrt(q*(1-q)) law. If the antiderivative is given as parameter, the t-digest algorithm does not need to be concerned with numerical integration.

@tdunning
Copy link
Owner

tdunning commented Feb 1, 2015

OK. I understand your point. But does it have practical impact?

Here, for instance is the result of applying the standard approach:

image

As you can see, close adherence. Magnifying:

image

Likewise, the cluster sizes don't exceed the limit. So if the limit is not exceeded, would this alternative make a big difference?

On the other hand, your integral formulation of the size limit and the resulting constant steps on logit space are intriguing relative to the size growth. Here are results of size growth for various size limits:

image

As you can see, there is qualitative difference between the standard q(1-q) rule and the sqrt(q(1-q)) rule. I don't fully understand this yet, but the approach I was trying to take was somewhat similar to your integral approach.

Here, btw, is the code that produced this graph. It uses a simplification of the t-digest that focuses only on a special case of ordered points where you know the number of points.

https://gist.github.com/tdunning/7a1d24ac3efb18138536

@oertl
Copy link
Author

oertl commented Feb 2, 2015

It would be interesting to investigate the cluster sizes over q also for the integral formulated constraint (especially in the magnified figure). There should be more centroids with size 1 near q=0 and q=1, allowing for more accurate quantile estimations in that range.

Anyway, I believe the integral formulation should be the fundamental constraint to start from. The rectangle rule, which leads to the original constraint, is a very good approximation for most values of q. However, for quantiles very close to 0 and 1 where the approximation error is largest better results could be obtained using a more accurate integration method.

The different centroid size growth over the number of inserted points can be explained as follows (using the integral formulation). For the q*(1-q) rule we have a couple of centroids of size 1 which violate the constraint. These centroids are the very first and the very last ones. All other centroids in between fulfill the constraint. Let's say there are X such violating centroids near 0 and due to symmetry also X such centroids near 1. If N is the total number of centroids we have N-2*Xcentroids fulfilling the constraint. Summing up the constraints int_{z_i}^{z_{i+1}} 1/(q*(1-q)) dq <= 4*delta for all these centroids (X+1 <= i <= N-X) gives int_{X/W}^{1-X/W}} f(q) dq <= 4*delta*(N-2*X) or 2*ln(W/X-1) <= 4*delta*(N-2*X). I am still unsure about the dependency of X on W. However, if it can be neglected, the number of centroids will increase at least logarithmically. This is also what the red graph suggests.

Contrary, the sqrt(q*(1-q)) rule allows the size of all centroids, including the first and the last ones, to scale linearly with the number of points. Consider for example the first centroid: The constraint for its size w_1 reads as int_{0}^{w_1/W} 1/sqrt(q*(1-q)) dq <= 4*delta or arcsin(2*w_1/W) + pi/2 <= 4*delta or 2*w_1/W <= sin(4*delta - pi/2). Hence, w_1 can get arbitrary large as W increases.
Consequently, for large W we can assume that the LHS of the constraint is almost equal to the RHS: int_{z_i}^{z_{i+1}} 1/sqrt(q*(1-q)) dq ~ 4*delta. Summing up the constraints for all centroids gives int_0^1 1/sqrt(q*(1-q)) dq ~ N*4*delta or pi ~ N*4*delta which explains that the number of centroids N approaches a constant. The key difference is that the integral of 1/(q*(1-q)) is indefinite and that of 1/sqrt(q*(1-q)) is finite.

The advantage of using a centroid size rule with finite integral is that the memory consumption will converge to a constant dependent only on the compression and not on the number of points. I think for most use cases this behavior is exactly what you expect from a quantile estimation algorithm. I do not know, if there are use cases which require to query quantiles which are more and more close to 0 and 1 while adding points, which would justify the increasing memory needs. Usually, the quantiles you are interested in are fixed and do not change with increasing number of inserted points.

@oertl
Copy link
Author

oertl commented Feb 4, 2015

Today I had time to analyze the asymptotic dependence of X on W. By definition the (X+1)-th centroid is the first centroid that fulfills the constraint: ln((X+1)/(W-X-1))-ln(X/(W-X)) < 4*delta or equivalently (X+1)*(W-X)/(X*(W-X-1)) < exp(4*delta)or W/(W*X-X*(X+1)) < exp(4*delta)-1. For W >> X the LHS converges to 1/X which gives 1/X < exp(4*delta)-1 and finally X > 1/(exp(4*delta)-1). Therefore, in case W >> X the smallest X fulfilling the inequality is approximately 1/(exp(4*delta)-1) which is approximately equal to 1/(4*delta) for delta<<1 and which is independent of W. As a consequence, the number of centroids really increases logarithmically with the number of points W for the q*(1-q) rule.

@tdunning
Copy link
Owner

tdunning commented Feb 5, 2015

Excellent. I had derived exactly the same result for how many 1 there
were, but I like the extension which shows that there is no dependence on
W. The other way to attack this is to assume linearity in q at the
periphery.

On Wed, Feb 4, 2015 at 12:33 PM, Otmar Ertl notifications@github.com
wrote:

Today I had time to analyze the asymptotic dependence of X on W. By
definition the (X+1)-th centroid is the first centroid that fulfills the
constraint: ln((X+1)/(W-X-1))-ln(X/(W-X)) < 4_delta or equivalently (X+1)(W-X)/(X(W-X-1))
< exp(4_delta)or W/(W_X-X_(X+1)) < exp(4_delta)-1. For W >> X the LHS
converges to 1/X which gives 1/X < exp(4_delta)-1 and finally X >
1/(exp(4_delta)-1). Therefore, in case W >> X the smallest X fulfilling
the inequality is approximately 1/(exp(4_delta)-1) which is approximately
equal to 1/(4_delta) for delta<<1 and which is independent of W. As a
consequence, the number of centroids really increases logarithmically with
the number of points W for the q_(1-q) rule.


Reply to this email directly or view it on GitHub
#40 (comment).

@tdunning
Copy link
Owner

I am closing this (quite late) because Otmar was completely right and this is now the basis for all supported implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants