-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NSP-12] New meanings for comparison and aggregate operators #179
Conversation
ndc-models/src/lib.rs
Outdated
pub enum AggregateFunctionDefinition { | ||
Min, | ||
Max, | ||
Sum { | ||
/// The scalar type of the result of this function, which should have | ||
/// one of the type representations Int64 or Float64, depending on | ||
/// whether this function is defined on a scalar type with an integer or | ||
/// floating-point representation, respectively. | ||
result_type: ScalarTypeName, | ||
}, | ||
Custom { | ||
/// The scalar or object type of the result of this function | ||
result_type: Type, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Average
function? That was mentioned in the RFC.
Max, | ||
Sum { | ||
/// The scalar type of the result of this function, which should have | ||
/// one of the type representations Int64 or Float64, depending on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't always the case (Int64 or Float64). For example, the postgres connector's Float4 (ie 32-bit float) sum function returns Float4. However, all integer types do appear to sum to a 64-bit int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.
Makes sense. Here is how `sum is defined in substrait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that, in something like Postgres, we'd have to cast up any Float4 column to Float8 before aggregation in order to meet this spec. This could be seen as not respecting the choice of the end-user when they chose Float4 precision for their column. Postgres seems to deliberately retain that precision during aggregation (which it doesn't do for integers, only floats, from what I can see in our PG connector schema), probably because you can sum up pretty far using floats, you just lose precision, unlike integers where you overflow.
Maybe overriding the chosen precision is okay, but it's something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 128-bit numeric types, unfortunately those wouldn't be supported by datafusion right now. We could still support custom aggregate operators for those types, but just not the standardized sum operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Max, | ||
Sum { | ||
/// The scalar type of the result of this function, which should have | ||
/// one of the type representations Int64 or Float64, depending on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type
Max, | ||
Sum { | ||
/// The scalar type of the result of this function, which should have | ||
/// one of the type representations Int64 or Float64, depending on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.
Makes sense. Here is how `sum is defined in substrait.
- `~`, `~*`, `!~`, `!~*` (regex-based matches, may be difficult to standardize across implementions, might consider "starts with", "contains" and "ends with" instead) | ||
- `@>`, `<@` (array operators) | ||
|
||
Other possible aggregate functions (from https://datafusion.apache.org/user-guide/sql/operators.html#comparison-operators): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior art on standardized aggregate functions: https://substrait.io/extensions/functions_arithmetic/#aggregate-functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add more aggregate functions to NDC, would that be a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add more aggregate functions to NDC, would that be a breaking change?
It would be breaking for engine since it would need to handle more cases. But if engine is checking the NDC version to make sure it's compatible with its own max-supported-version (not sure if we do this), then it's not a concern.
OK, so I've updated this PR and it should be ready for final review and merge. cc @paf31
Personally I still don't really like that we're forcing upcasting of lower precision types to higher precision types on aggregation, but I'm happy to start stricter like that and see if it causes issues for connectors and performance and we can relax it later if necessary. |
details: serde_json::Value::Null, | ||
}), | ||
) | ||
if let Some(first_value) = values.iter().next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is quoted verbatim in the spec, so it's probably going to be a bit hard to follow now. You might want to break up the code a bit in the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've reduced the example in the docs down to just the first part of the function. The rest are basically the same just slight variants based on the type. The function isn't actually that big, it's just the shitty rust formatter that loves adding as many newlines as it possibly can blowing out every piece of error handling code into a lot of lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
reference/types.md
ndc-test
if appropriatespecification/capabilities.md
in the specificationndc-test