Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Arrow 45.0.0 And Datum Arithmetic, change Decimal Division semantics #6832

Merged
merged 24 commits into from
Aug 8, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 3, 2023

Which issue does this PR close?

Closes #6828
Closes #6933
Closes #7068

Rationale for this change

Use new kernels added in apache/arrow-rs#4465

Make timestamp arithmetic kernels more precse

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

  • Timestamp arithmetic will now return durations
  • Adding dates to negative intervals no longer works (support was only added in Cleanup type coercion (#3419) #6778 largely by accident)

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Jul 3, 2023
@tustvold tustvold force-pushed the datum-arithmetic branch from cba0451 to 5d54984 Compare July 4, 2023 08:33
@tustvold tustvold force-pushed the datum-arithmetic branch from 5d54984 to 68a89b3 Compare July 4, 2023 14:39
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 4, 2023
@@ -430,15 +430,11 @@ select '1 month'::interval + '1980-01-01T12:00:00'::timestamp;
----
1980-02-01T12:00:00

query D
query error DataFusion error: Error during planning: Cannot coerce arithmetic expression Interval\(MonthDayNano\) \- Date32 to valid types
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query was only supported following #6778 and was done so accidentally - #6778 (comment)

I tested some other systems, duckdb explicitly doesn't support this, and I couldn't work out how to get mysql to express this, so I think this is fine

"+-----+---------+",
"| val | ts_diff |",
"+-----+---------+",
"| 3 | PT30S |",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp arithmetic now returns the appropriate duration type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the PT30S format is some sort of standard but I think it is a pretty terrible UX and very uncommon.

I know there is a technical difference between duration and interval in the arrow spec but as a user, I find this very confusing:

❯ select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S        |
+----------------------------+

On the other hand the original formatting is not consistent with Postgres either.

postgres=# select interval '5 minutes 2  days';
    interval
-----------------
 2 days 00:05:00
(1 row)

postgres=# select interval '5 minutes';
 interval
----------
 00:05:00
(1 row)

I played around with this locally and we can still get an interval with an explicit cast

select (now() - '2023-01-01'::timestamp);
+----------------------------+
| now() - Utf8("2023-01-01") |
+----------------------------+
| P198DT72932.972880S        |
+----------------------------+
1 row in set. Query took 0.002 seconds.
❯ select (now() - '2023-01-01'::timestamp)::interval;
+------------------------------------------------------------+
| now() - Utf8("2023-01-01")                                 |
+------------------------------------------------------------+
| 0 years 0 mons 0 days 4772 hours 15 mins 35.571791000 secs |
+------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

So we can perhaps add the appropriate coercion rules if necessary

Or perhaps we can change the formatting of Durations in arrow-rs to be more interval like

@waitingkuo do you have any thoughts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to change the formatting upstream if there is consensus on what it should be. We are currently just doing what chrono does, which appears to be ISO 8601

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update here is that @tustvold and I spoke, and the next action is I will file a ticket explaining the behavior change and trying to build some consensus on what to do here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the coercion rules might not work well.

select ('2023-01-02'::timestamp - '2023-01-01'::timestamp)::interval;
+--------------------------------------------------------+
| Utf8("2023-01-02") - Utf8("2023-01-01")                |
+--------------------------------------------------------+
| 0 years 0 mons 0 days 24 hours 0 mins 0.000000000 secs |
+--------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

it's converted to (0 month, 0 day, 86400000000000 nanos) instead of (0 month, 1 day, 0 nanos)

which isn't consist with

select ('2023-01-02'::timestamp - '2023-01-01'::timestamp);
+-----------------------------------------+
| Utf8("2023-01-02") - Utf8("2023-01-01") |
+-----------------------------------------+
| P1D                                     |
+-----------------------------------------+
1 row in set. Query took 0.002 seconds.

both formats are fine so we. in case we decide to align with interval format, we should discuss whether it outputs
0 years 0 mons 1 days 0 hours 0 mins 0 secs or
0 years 0 mons 0 days 24 hours 0 mins 0 secs in this case

Copy link
Contributor Author

@tustvold tustvold Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably change duration output to be a number of the duration unit, i.e. 5000 seconds, or 2500 milliseconds, etc... to avoid potential ambiguity over what the duration represents. It might be confusing if both intervals and durations report some number of days, but arithmetic involving them has different semantics across DST boundaries. I will file an upstream ticket and work on it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed two tickets of interest:

@@ -5912,285 +5859,6 @@ mod tests {
check_array(array);
}

fn get_timestamp_test_data(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is removed as timestamp arithmetic no longer returns intervals (which avoids this whole mess of edge cases)

3 0 years 0 mons 15952 days 23 hours 22 mins 12.667123455 secs
2 0 years 0 mons 8406 days 1 hours 1 mins 54.877123455 secs
1 0 years 0 mons 53 days 16 hours 0 mins 20.000000024 secs
3 P15952DT84132.667123455S
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does highlight another fun quirk of intervals - they sort strangely

@tustvold tustvold changed the title POC: Datum arithmetic Datum arithmetic Jul 4, 2023
@tustvold tustvold force-pushed the datum-arithmetic branch from 07e83f7 to 5180300 Compare July 4, 2023 16:04
@alamb
Copy link
Contributor

alamb commented Jul 8, 2023

🤯
Screenshot 2023-07-08 at 1 22 32 PM

@@ -236,9 +179,6 @@ mod tests {
test_array_negative_op!(Int64, 23456i64, 12345i64);
test_array_negative_op!(Float32, 2345.0f32, 1234.0f32);
test_array_negative_op!(Float64, 23456.0f64, 12345.0f64);
test_array_negative_op_intervals!(YearMonth, 2345i32, 1234i32);
test_array_negative_op_intervals!(DayTime, 23456i64, 12345i64);
test_array_negative_op_intervals!(MonthDayNano, 234567i128, 123456i128);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved into sqllogic, unfortunately I couldn't find a way to get a MonthDayNano type, but this is better than nothing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do something like this

select arrow_cast(interval '30 minutes', 'Interval(MonthDayNano)');

@tustvold tustvold changed the title Datum arithmetic Update Arrow 44.0.0 And Datum Arithmetic Jul 18, 2023
@tustvold tustvold marked this pull request as ready for review July 18, 2023 18:59
@tustvold tustvold added the api change Changes the API exposed to users of the crate label Jul 18, 2023
"+-----+---------------------------------------------------+",
"| 3 | 0 years 0 mons 0 days 10 hours 0 mins 30.000 secs |",
"| 1 | 0 years 0 mons 0 days 10 hours 0 mins 20.000 secs |",
"| 2 | 0 years 0 mons 0 days 10 hours 0 mins 10.000 secs |",
Copy link
Contributor Author

@tustvold tustvold Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually wrong before, the logic in make_timestamp_tz_sub_table initializes the arrays as follows

let timestamps1 = vec![
        1_678_892_420_000_000_000i64 / divisor, //2023-03-15T15:00:20.000_000_000
        1_678_892_410_000_000_000i64 / divisor, //2023-03-15T15:00:10.000_000_000
        1_678_892_430_000_000_000i64 / divisor, //2023-03-15T15:00:30.000_000_000
    ];
    let timestamps2 = vec![
        1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
        1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
        1_678_892_400_000_000_000i64 / divisor, //2023-03-15T15:00:00.000_000_000
    ];

    let array1 =
        PrimitiveArray::<A>::from_iter_values(timestamps1).with_timezone_opt(tz1);
    let array2 =
        PrimitiveArray::<A>::from_iter_values(timestamps2).with_timezone_opt(tz2);

Values stored in timestamp arrays are always with respect to the UTC epoch, therefore when subtracting timestamps the timezone should not matter

@tustvold tustvold marked this pull request as ready for review August 2, 2023 17:13
@alamb
Copy link
Contributor

alamb commented Aug 2, 2023

@viirya / @waitingkuo / @liukun4515 I wonder if you could verify this PR is acceptable (specifically the Decimal display differences as well as the change in decimal division semantics)

@Dandandan
Copy link
Contributor

Does this also fix #6794 (see #5646 (comment))

@tustvold
Copy link
Contributor Author

tustvold commented Aug 3, 2023

It should do, but I haven't verified this

@alamb
Copy link
Contributor

alamb commented Aug 3, 2023

I have posted an announcement about this PR to the mailing list: https://lists.apache.org/thread/837ghhjh9gd1kvg0qqnmxqs0rm62x0xt

I will plan to merge it in Monday Aug 7 2023 unless I hear otherwise

1 similar comment
@alamb
Copy link
Contributor

alamb commented Aug 3, 2023

I have posted an announcement about this PR to the mailing list: https://lists.apache.org/thread/837ghhjh9gd1kvg0qqnmxqs0rm62x0xt

I will plan to merge it in Monday Aug 7 2023 unless I hear otherwise

@tustvold
Copy link
Contributor Author

tustvold commented Aug 3, 2023

apache/arrow-rs#4640 proposes a further tweak to the decimal arithmetic rules. In light of #6828 where decimal division currently silently returns nonsensical answers, I'm not sure this should be a blocker to getting this in, but defer to the consensus. Worst case we could do a patch release for arrow-rs, but I would rather avoid this if possible

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

I am going to update this PR and resolve conflicts

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

I believe we have resolved all concerns on this PR and it represents a significant step forward in terms of correctness I plan to merge it in once CI has passed.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

I am running this update against the IOx tests as one final check: https://github.com/influxdata/influxdb_iox/pull/8452

@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

Our tests passed. 🚀 Onward

@alamb alamb merged commit 27c7ae8 into apache:main Aug 8, 2023
@alamb
Copy link
Contributor

alamb commented Aug 8, 2023

One final shoutout for this cleanup in this PR:

Screenshot 2023-08-08 at 12 56 17 PM

@viirya
Copy link
Member

viirya commented Aug 8, 2023

@viirya / @waitingkuo / @liukun4515 I wonder if you could verify this PR is acceptable (specifically the Decimal display differences as well as the change in decimal division semantics)

I only concern the change of decimal division semantics could be an impact to some customers potentially if high precision decimal results are important to them. Ideally I'd like to get same precision results as Spark but under current design I think it might be tricky to have two different precision rules for decimal operations. It could possibly make DataFusion complicated on decimal operation too. So it may not be a good idea. Among existing systems, I also don't know if any provides alternative option regarding decimal precision rule.

@tustvold
Copy link
Contributor Author

tustvold commented Aug 8, 2023

apache/arrow-rs#4664 tracks further upstream work related to decimal arithmetic. After that we will have some more options in this space, and can potentially consider a different approach to decimal division inline with Spark. I am currently working on this, but until then the use of the Spark rule is just a recipe for overflow errors, as we have seen

@liukun4515
Copy link
Contributor

@viirya / @waitingkuo / @liukun4515 I wonder if you could verify this PR is acceptable (specifically the Decimal display differences as well as the change in decimal division semantics)

I only concern the change of decimal division semantics could be an impact to some customers potentially if high precision decimal results are important to them. Ideally I'd like to get same precision results as Spark but under current design I think it might be tricky to have two different precision rules for decimal operations. It could possibly make DataFusion complicated on decimal operation too. So it may not be a good idea. Among existing systems, I also don't know if any provides alternative option regarding decimal precision rule.

@tustvold @alamb @viirya
spark has the config to control the precision loss : https://github.com/apache/spark/blob/2be20e54a2222f6cdf64e8486d1910133b43665f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L246

@tustvold
Copy link
Contributor Author

tustvold commented Aug 16, 2023

apache/arrow-rs#4664 (comment) summarises my thoughts on the issue having spent significant time on this. TLDR is we already support a higher precision than most commercial systems, and given this and the fact I personally have no use-cases for decimals, I find it hard to justify spending more of my time on this.

For people looking to emulate spark which only supports precision up to 38, casting to Decimal256 and then truncating down to Decimal128 will be equivalent, and is what a precision loss arithmetic kernel would do

@alamb
Copy link
Contributor

alamb commented Aug 16, 2023

I filed #7301 to track possibly adding a spark decimal compatibility mode if anyone needs it

@viirya
Copy link
Member

viirya commented Aug 16, 2023

@tustvold @alamb @viirya spark has the config to control the precision loss : https://github.com/apache/spark/blob/2be20e54a2222f6cdf64e8486d1910133b43665f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L246

I know it. But so what? You mean that we ask Spark users to disable it so Spark can be consistent with DataFusion??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
7 participants