Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NSP-12] New meanings for comparison and aggregate operators #179

Merged
merged 11 commits into from
Oct 30, 2024

Conversation

paf31
Copy link
Collaborator

@paf31 paf31 commented Sep 10, 2024

  • RFC
  • Specification updates
    • Changelog
    • Specification pages
    • Tutorial pages
    • reference/types.md
  • Implement your feature in the reference connector
  • Generate test cases in ndc-test if appropriate
  • Does your feature add a new capability?
    • Update specification/capabilities.md in the specification
    • Check the capability in ndc-test

@paf31 paf31 marked this pull request as ready for review September 11, 2024 18:53
@paf31 paf31 requested a review from codedmart as a code owner September 11, 2024 18:53
@paf31 paf31 changed the title [PACHA-41] New meanings for comparison and aggregate operators [NSP-12] New meanings for comparison and aggregate operators Sep 16, 2024
Comment on lines 339 to 352
pub enum AggregateFunctionDefinition {
Min,
Max,
Sum {
/// The scalar type of the result of this function, which should have
/// one of the type representations Int64 or Float64, depending on
/// whether this function is defined on a scalar type with an integer or
/// floating-point representation, respectively.
result_type: ScalarTypeName,
},
Custom {
/// The scalar or object type of the result of this function
result_type: Type,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Average function? That was mentioned in the RFC.

Max,
Sum {
/// The scalar type of the result of this function, which should have
/// one of the type representations Int64 or Float64, depending on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't always the case (Int64 or Float64). For example, the postgres connector's Float4 (ie 32-bit float) sum function returns Float4. However, all integer types do appear to sum to a 64-bit int.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

Makes sense. Here is how `sum is defined in substrait.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that, in something like Postgres, we'd have to cast up any Float4 column to Float8 before aggregation in order to meet this spec. This could be seen as not respecting the choice of the end-user when they chose Float4 precision for their column. Postgres seems to deliberately retain that precision during aggregation (which it doesn't do for integers, only floats, from what I can see in our PG connector schema), probably because you can sum up pretty far using floats, you just lose precision, unlike integers where you overflow.

Maybe overriding the chosen precision is okay, but it's something to consider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 128-bit numeric types, unfortunately those wouldn't be supported by datafusion right now. We could still support custom aggregate operators for those types, but just not the standardized sum operator.

hallettj
hallettj previously approved these changes Sep 17, 2024
Copy link

@hallettj hallettj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Max,
Sum {
/// The scalar type of the result of this function, which should have
/// one of the type representations Int64 or Float64, depending on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type

0x777
0x777 previously approved these changes Sep 17, 2024
Max,
Sum {
/// The scalar type of the result of this function, which should have
/// one of the type representations Int64 or Float64, depending on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

Makes sense. Here is how `sum is defined in substrait.

- `~`, `~*`, `!~`, `!~*` (regex-based matches, may be difficult to standardize across implementions, might consider "starts with", "contains" and "ends with" instead)
- `@>`, `<@` (array operators)

Other possible aggregate functions (from https://datafusion.apache.org/user-guide/sql/operators.html#comparison-operators):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior art on standardized aggregate functions: https://substrait.io/extensions/functions_arithmetic/#aggregate-functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add more aggregate functions to NDC, would that be a breaking change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add more aggregate functions to NDC, would that be a breaking change?

It would be breaking for engine since it would need to handle more cases. But if engine is checking the NDC version to make sure it's compatible with its own max-supported-version (not sure if we do this), then it's not a concern.

@daniel-chambers daniel-chambers dismissed stale reviews from 0x777 and hallettj via 69ab675 October 21, 2024 01:16
@daniel-chambers
Copy link
Collaborator

OK, so I've updated this PR and it should be ready for final review and merge. cc @paf31

  • Added discussed average aggregate function to the spec
  • Updated reference implementation with new comparison operators and aggregate functions
  • Updated reference documentation with new aggregate function
  • Updated tutorial documentation
  • Updated changelog

Personally I still don't really like that we're forcing upcasting of lower precision types to higher precision types on aggregation, but I'm happy to start stricter like that and see if it causes issues for connectors and performance and we can relax it later if necessary.

details: serde_json::Value::Null,
}),
)
if let Some(first_value) = values.iter().next() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is quoted verbatim in the spec, so it's probably going to be a bit hard to follow now. You might want to break up the code a bit in the text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've reduced the example in the docs down to just the first part of the function. The rest are basically the same just slight variants based on the type. The function isn't actually that big, it's just the shitty rust formatter that loves adding as many newlines as it possibly can blowing out every piece of error handling code into a lot of lines.

Copy link
Collaborator Author

@paf31 paf31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@daniel-chambers daniel-chambers merged commit 21941c9 into main Oct 30, 2024
11 checks passed
@daniel-chambers daniel-chambers deleted the phil/pacha-41-vs-main branch October 30, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants