-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a way to represent constrained statistics / bounds on values in Statistics #8078
Comments
A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on apache/parquet-format#196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI |
I have two proposals: If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:
|
@berkaysynnada under what circumstances would we use |
Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back |
You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals. |
In proposal two, you still need On the contrary: If you have BTW @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds. When we decide to support slightly more complicated filters, say |
I think there is a mismatch between the current use of Specifically, For example the comments on However, it is used to skip processing files when the (inexact) row count is above the fetch/limit: I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for Another example is |
Another challenge I found with using I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find |
I think the documentation is correct but you have found 2 bugs in the code:
I will open a PR now fixing these issues. |
Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in |
Any thoughts on @berkaysynnada's proposal; i.e.
I think extending the BTW this exercise revealed that the name |
I really like it. My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc However, I spent much more of my time this week on apache/arrow-rs#5050 and related discussions around #7994 and #8017 than I had planned, so I am sadly behind I hope to make progress tomorrow |
It is unlikely I will have time to work on this today -- I will pick it up next week |
I have been playing and studying this code. While the suggestion from @ozankabak and @berkaysynnada in #8078 (comment) is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX). Another example would be a use case where distinguishing between ranges like
Thus I am going to prototype what adding I also plan to encapsulate more of the checks into
I'll report back here with how it goes shorty |
Do we want to support something like this in the short term? Then we would need inexact bounds on both sides. |
Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand. I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of |
After some thought, I think the key information we are trying to capture in statistics are:
I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values However, perhaps something like this could: pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
/// The exact value is known
Exact(T),
/// The exact value is not known, but the real value is **known** to be within
/// the specified range: `lower <= value <= upper` and furthermore is estimated
/// to be `estimate`. However, the real value may fall anywhere in the range
/// TODO: we could use `Interval` here instead, which could represent more complex cases (like
/// open/closed bounds rather than using T::MIN/T::MAX)
Bounded { lower: T, estimate: T upper: T},
/// Nothing is known about the value
#[default]
Absent,
} Maybe estimate should be Example1: Filter
|
Thank you for investigating this. There are some parts I don't quite follow. Particularly,
I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be: Precision::Range(
PointEstimate::Exact(0), // min
PointEstimate::Exact(N), // max
) which captures the known min/max values. However, this loses our inexact guess Example 2
|
I like your proposal for the new type (reproduced below). pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
/// The exact value is known
Exact(T),
/// The exact value is not known, but the real value is **known** to be within
/// the specified range: `lower <= value <= upper` and furthermore is estimated
/// to be `estimate`. However, the real value may fall anywhere in the range
/// TODO: we could use `Interval` here instead, which could represent more complex cases (like
/// open/closed bounds rather than using T::MIN/T::MAX)
Bounded { lower: T, estimate: T upper: T},
/// Nothing is known about the value
#[default]
Absent,
} A few questions and remarks:
Assuming these tweaks, the type would look like: pub enum Estimate<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
Range { lower: T, value: T, upper: T },
/// Nothing is known about the value
#[default]
Absent,
} or with pub enum Estimate {
Range { lower: ScalarValue, value: ScalarValue, upper: ScalarValue },
/// Nothing is known about the value
#[default]
Absent,
} Once pub enum Estimate {
Range { bounds: Interval, value: ScalarValue },
/// Nothing is known about the value
#[default]
Absent,
} |
Yes, this is what I was trying to say. Thank you
Yes, I think so (though I am not 100% sure about the 62.5 probability -- I need to double check my math)
That is a good point
I agree this is probably a useful special case.
Yes, the more I play with the code, the more I think using Interval is important to avoid reimplementing the same thing over and over again
I agree the term |
FWIW what I am doing now is studying how the existing I will think on this more tonight and continue the work tomorrow. |
Update here: I filed a larger Epic to track all the work that seems to be going on with statistics as it was getting pretty overwhelming to keep it all straight: #8227 My current conclusion is the current state of the Thus, I plan to work on #8229 and #8228 first and then come back here |
This has come up again recently -- I started a discussion here Suremarc has a PR here: |
Now that the interval library is in pub enum Estimate {
Range { bounds: Interval, value: ScalarValue },
/// Nothing is known about the value
#[default]
Absent,
} This would be great to have and would support all use cases as far as I can tell. |
Is your feature request related to a problem or challenge?
This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)
Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down
For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.
For example:
aaa......z
qqq......q
a
r
There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column
Currently
Precision
only representsExact
andInexact
, there is no way to represent "unexact, but bounded above/below"Describe the solution you'd like
Per @berkaysynnada I propose changing
Precision::Inexact
to a new variantPrecision::Between
which would store anInterval
of known min/maxes of the value.This is a quite general formulation, and it can describe "how" inexact the values are.
This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)
Describe alternatives you've considered
There is also a possibility of introducing a simpler, but more limited version of these statistics, like:
Additional context
No response
The text was updated successfully, but these errors were encountered: