-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement conversion from ColumnStatistics to NullableInterval #10510
Conversation
dd9f55f
to
0d75e43
Compare
}) | ||
match self.data_type() { | ||
Some(datatype) => Ok(Self::Null { datatype }), | ||
None => Err(DataFusionError::NotImplemented( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use not_impl_err! macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
0d75e43
to
9ff8cb4
Compare
@@ -1469,6 +1472,8 @@ pub enum NullableInterval { | |||
MaybeNull { values: Interval }, | |||
/// The value is definitely not null, and is within the specified range. | |||
NotNull { values: Interval }, | |||
/// Added to handle cases with insufficient statistics | |||
Unknown, | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaybeNull
with an unbounded interval corresponds to this Unknown variant, doesn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea. Do you mean we should use ScalarValue::Null in Interval if the value is Absent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, Interval::MaybeNull{Null, Null} looks strange, honestly.
@jayzhan211 wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ScalarValue::Null for representing unbounded intervals is discouraged. Although it's possible to construct these, performing arithmetic on them is not feasible. Could you detect the datatype and instead create ScalarValue::Datatype(None)? This approach is consistent with how we handle unbounded cases throughout the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I saw is that sometimes ColumnStatistics { ..., min, max, ... } both are equal to Precision::Absent. In such cases, we can't determine the type for Interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to play with the stats in CLI mode, I attempted to get them from the ExecutionPlan
. Here, there was a default absent value:
let plan = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to get them from the ExecutionPlan. Here, there was a default absent value:
I would expect that calling execution_plan.statistics()
would match the schema returned from execution_plan.schema()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will re-check tomorrow then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anywhere we have
ColumnStatistics
we should also have the schema (and thus theDataType
of all columns) -- maybe we just have to pass the schema along with the statistics?
Yes, I think that's the right approach for this issue. Can we open a ticket for this? It should be an easy fix, and ColumnStatistics
isn't used in many parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
I have conducted some additional tests on various queries and observed the following results:
CREATE TABLE data_table (
id INT,
value INT
);
INSERT INTO data_table (id, value) VALUES
(1, 100),
(2, 200),
(3, 300),
(4, 400),
(5, 500),
(6, 600),
(7, 700),
(8, 800),
(9, 900),
(10, 1000);
SELECT id, value FROM data_table WHERE value > 500;
Log:
Schema: id: Int32, value: Int32
Column 0: ColumnStatistics { null_count: Inexact(0), max_value: Exact(Int32(NULL)), min_value: Exact(Int32(NULL)), distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Inexact(0), max_value: Inexact(Int32(NULL)), min_value: Inexact(Int32(501)), distinct_count: Absent }
SELECT AVG(value) AS average_value, SUM(value) AS total_value FROM data_table;
Log:
Schema: average_value: Float64, total_value: Int64
Column 0: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }
SELECT a.id AS id_a, a.value AS value_a, b.id AS id_b, b.value AS value_b
FROM data_table a
CROSS JOIN data_table b;
Log:
Schema: id_a: Int32, value_a: Int32, id_b: Int32, value_b: Int32
Column 0: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 2: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 3: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dmitrybugakov @jayzhan211 and @berkaysynnada
@@ -1469,6 +1472,8 @@ pub enum NullableInterval { | |||
MaybeNull { values: Interval }, | |||
/// The value is definitely not null, and is within the specified range. | |||
NotNull { values: Interval }, | |||
/// Added to handle cases with insufficient statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Added to handle cases with insufficient statistics | |
/// No information is known about this intervals values or nullness |
@@ -1469,6 +1472,8 @@ pub enum NullableInterval { | |||
MaybeNull { values: Interval }, | |||
/// The value is definitely not null, and is within the specified range. | |||
NotNull { values: Interval }, | |||
/// Added to handle cases with insufficient statistics | |||
Unknown, | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I saw is that sometimes ColumnStatistics { ..., min, max, ... } both are equal to Precision::Absent. In such cases, we can't determine the type for Interval.
Anywhere we have ColumnStatistics
we should also have the schema (and thus the DataType
of all columns) -- maybe we just have to pass the schema along with the statistics?
This change addresses part of apache#10456.
9ff8cb4
to
b745673
Compare
@alamb
Or should I close the PR and create a new issue instead? |
Which issue does this PR close?
This change addresses part of #10456.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?