[NSP-12] New meanings for comparison and aggregate operators #179

paf31 · 2024-09-10T16:58:50Z

daniel-chambers · 2024-09-17T00:35:41Z

ndc-models/src/lib.rs

+pub enum AggregateFunctionDefinition {
+    Min,
+    Max,
+    Sum {
+        /// The scalar type of the result of this function, which should have
+        /// one of the type representations Int64 or Float64, depending on
+        /// whether this function is defined on a scalar type with an integer or
+        /// floating-point representation, respectively.
+        result_type: ScalarTypeName,
+    },
+    Custom {
+        /// The scalar or object type of the result of this function
+        result_type: Type,
+    },


No Average function? That was mentioned in the RFC.

daniel-chambers · 2024-09-17T00:39:28Z

ndc-models/src/lib.rs

+    Max,
+    Sum {
+        /// The scalar type of the result of this function, which should have
+        /// one of the type representations Int64 or Float64, depending on


This isn't always the case (Int64 or Float64). For example, the postgres connector's Float4 (ie 32-bit float) sum function returns Float4. However, all integer types do appear to sum to a 64-bit int.

Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type

Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

Makes sense. Here is how `sum is defined in substrait.

This means that, in something like Postgres, we'd have to cast up any Float4 column to Float8 before aggregation in order to meet this spec. This could be seen as not respecting the choice of the end-user when they chose Float4 precision for their column. Postgres seems to deliberately retain that precision during aggregation (which it doesn't do for integers, only floats, from what I can see in our PG connector schema), probably because you can sum up pretty far using floats, you just lose precision, unlike integers where you overflow.

Maybe overriding the chosen precision is okay, but it's something to consider.

For 128-bit numeric types, unfortunately those wouldn't be supported by datafusion right now. We could still support custom aggregate operators for those types, but just not the standardized sum operator.

hallettj

Looks good to me!

hallettj · 2024-09-17T18:26:07Z

ndc-models/src/lib.rs

+    Max,
+    Sum {
+        /// The scalar type of the result of this function, which should have
+        /// one of the type representations Int64 or Float64, depending on


I suppose I'll add that MongoDB has a Float128 type which I would want to use with inputs of that type

0x777 · 2024-09-17T20:43:31Z

ndc-models/src/lib.rs

+    Max,
+    Sum {
+        /// The scalar type of the result of this function, which should have
+        /// one of the type representations Int64 or Float64, depending on


Right, but a connector can always upcast a smaller int or float type to a 64-bit type, and since we can't control how many rows will be summed, we should pick the largest possible representations.

Makes sense. Here is how `sum is defined in substrait.

0x777 · 2024-09-17T20:46:28Z

rfcs/0021-comparison-and-aggregate-meanings.md

+- `~`, `~*`, `!~`, `!~*` (regex-based matches, may be difficult to standardize across implementions, might consider "starts with", "contains" and "ends with" instead)
+- `@>`, `<@` (array operators)
+
+Other possible aggregate functions (from https://datafusion.apache.org/user-guide/sql/operators.html#comparison-operators):


Prior art on standardized aggregate functions: https://substrait.io/extensions/functions_arithmetic/#aggregate-functions.

If we add more aggregate functions to NDC, would that be a breaking change?

If we add more aggregate functions to NDC, would that be a breaking change?

It would be breaking for engine since it would need to handle more cases. But if engine is checking the NDC version to make sure it's compatible with its own max-supported-version (not sure if we do this), then it's not a concern.

daniel-chambers · 2024-10-21T01:20:40Z

OK, so I've updated this PR and it should be ready for final review and merge. cc @paf31

Added discussed average aggregate function to the spec
Updated reference implementation with new comparison operators and aggregate functions
Updated reference documentation with new aggregate function
Updated tutorial documentation
Updated changelog

Personally I still don't really like that we're forcing upcasting of lower precision types to higher precision types on aggregation, but I'm happy to start stricter like that and see if it causes issues for connectors and performance and we can relax it later if necessary.

paf31 · 2024-10-28T23:07:05Z

ndc-reference/bin/reference/main.rs

-                            details: serde_json::Value::Null,
-                        }),
-                    )
+    if let Some(first_value) = values.iter().next() {


This function is quoted verbatim in the spec, so it's probably going to be a bit hard to follow now. You might want to break up the code a bit in the text.

OK, I've reduced the example in the docs down to just the first part of the function. The rest are basically the same just slight variants based on the type. The function isn't actually that big, it's just the shitty rust formatter that loves adding as many newlines as it possibly can blowing out every piece of error handling code into a lot of lines.

paf31

Looks good, thanks!

paf31 added 7 commits September 10, 2024 09:55

New meanings for comparison and aggregate operators

9413459

RFC

be86fd2

fmt

117e5f1

clippy

a661016

Spec and changelog

f53a991

Add result type to SUM

b86d8cb

pedantry

aab9958

paf31 marked this pull request as ready for review September 11, 2024 18:53

paf31 requested a review from codedmart as a code owner September 11, 2024 18:53

paf31 requested review from daniel-chambers and 0x777 September 11, 2024 19:02

paf31 changed the title ~~[PACHA-41] New meanings for comparison and aggregate operators~~ [NSP-12] New meanings for comparison and aggregate operators Sep 16, 2024

daniel-chambers reviewed Sep 17, 2024

View reviewed changes

hallettj previously approved these changes Sep 17, 2024

View reviewed changes

0x777 previously approved these changes Sep 17, 2024

View reviewed changes

daniel-chambers added 2 commits October 18, 2024 15:33

Merge remote-tracking branch 'origin/main' into phil/pacha-41-vs-main

54ab798

Added average agg function, update reference impl, updated docs

69ab675

daniel-chambers dismissed stale reviews from 0x777 and hallettj via 69ab675 October 21, 2024 01:16

paf31 commented Oct 28, 2024

View reviewed changes

daniel-chambers added 2 commits October 29, 2024 16:59

Merge remote-tracking branch 'origin/main' into phil/pacha-41-vs-main

53b99c6

Adjust aggregates tutorial

a191260

paf31 commented Oct 29, 2024

View reviewed changes

codedmart approved these changes Oct 29, 2024

View reviewed changes

daniel-chambers merged commit 21941c9 into main Oct 30, 2024
11 checks passed

daniel-chambers deleted the phil/pacha-41-vs-main branch October 30, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NSP-12] New meanings for comparison and aggregate operators #179

[NSP-12] New meanings for comparison and aggregate operators #179

paf31 commented Sep 10, 2024 •

edited

Loading

daniel-chambers Sep 17, 2024

daniel-chambers Sep 17, 2024

paf31 Sep 17, 2024

hallettj Sep 17, 2024

0x777 Sep 17, 2024

daniel-chambers Sep 18, 2024

paf31 Oct 8, 2024

hallettj left a comment

hallettj Sep 17, 2024

0x777 Sep 17, 2024

0x777 Sep 17, 2024

0x777 Sep 17, 2024

paf31 Sep 30, 2024

daniel-chambers commented Oct 21, 2024

paf31 Oct 28, 2024

daniel-chambers Oct 29, 2024

paf31 left a comment

[NSP-12] New meanings for comparison and aggregate operators #179

[NSP-12] New meanings for comparison and aggregate operators #179

Conversation

paf31 commented Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hallettj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-chambers commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paf31 left a comment

Choose a reason for hiding this comment

paf31 commented Sep 10, 2024 •

edited

Loading