Timestamp subtraction and interval operations for `ScalarValue` #5603

berkaysynnada · 2023-03-14T16:23:21Z

Which issue does this PR close?

Rationale for this change

This PR is composed of two parts but served as one PR since the parts are closely related (they share constants, change similar places in the file). 340 lines are added for implementation, and 670 are for tests.

The first support is to timestamp subtraction within ScalarValue. In postgre, the behaviour is like:

CREATE TABLE test (
  id INT,
  ts1 TIMESTAMP,
  ts2 TIMESTAMP
  
);

INSERT INTO test VALUES (1, '2016-03-13 00:00:00', '2017-04-13 12:11:03');
INSERT INTO test VALUES (2, '2016-03-04 06:00:00', '2016-03-01 20:33:48');
INSERT INTO test VALUES (3, '2017-03-04 23:59:59', '2016-03-01 12:00:00');

SELECT
ts2 - ts1 AS diff_ts
FROM test;

We can result in the same way with postgre. The difference is shown in days as the largest time unit. However, since we don't have such detailed fields for an interval, TimestampSecond or TimestampMillisecond timestamp subtractions give result in IntervalDayTime variant, and TimestampMicrosecond or TimestampNanosecond timestamp subtractions give result in IntervalMonthDayNano variant without using the month field. However, I need to underline that we can apply this operation only on scalar values. Supporting columnar operations will be in the following PR.

The second support is comparing, adding and subtracting two ScalarValue interval types, IntervalYearMonth, IntervalDayTime, IntervalMonthDayNano. With this PR, we can apply these operations between both the same and different variants. Columnar value support will be in the following PR as well.

What changes are included in this PR?

impl_op match arms are extended to cover timestamp and interval types. It should be noted that we need the same types of timestamps to apply subtraction. impl PartialOrd for ScalarValue and impl PartialEq for ScalarValue are extended to handle interval types. However, to be able to compare months and days, we need to assume a month equals to 30 days (postgre behaves in the same way).

Are these changes tested?

Yes, there are some tests covering edge cases with timezone options, and randomized tests that first add/subtract an interval type to/from a timestamp, then take the difference between the resulting timestamp and the initial timestamp to assert the equality with the given interval. For the interval operations, some tests are written to cover all variations of operations for different types.

timestamp_next	timestamp_prev	Return as days
2017-04-16 00:00:00 . 000_000_100	2016-03-13 00:00:00 . 000_000_025	months: 0, days: 399, nanos: 75
2016-03-13 00:00:00 . 000_000_025	2017-04-16 00:00:00 . 000_000_100	months: 0, days: -399, nanos: -75
2016-12-16 01:00:00 . 100	2016-12-15 00:00:00 . 000	days: 1, millis: 3600100
2016-12-15 00:00:00 . 000	2016-12-16 01:00:00 . 100	days: -1, millis: -3600100

Are there any user-facing changes?

To run end to end queries including timestamp subtraction:

try_new_impl function in arrow-rs gives an error while validating the schema and columns as such:

column types must match schema types, expected Timestamp(Second, None) but found Interval(DayTime) at column index 0

That expectation of the output column needs to be changed to interval type.

In Datafusion, there is a function coerce_types returning the output type of applying op to an argument of lhs_type and rhs_type. These codes with the extensions in planner.rs and binary.rs also need to be reshaped. evaluate function in datetime.rs has to handle columnar values at each side.

tustvold · 2023-03-15T11:43:43Z

datafusion/common/src/scalar.rs

+/// This function creates the [`NaiveDateTime`] object corresponding to the
+/// given timestamp using the units (tick size) implied by argument `mode`.
+#[inline]
+fn with_timezone_to_naive_datetime(


This should probably make use of https://docs.rs/arrow-array/latest/arrow_array/timezone/struct.Tz.html to parse to a DateTime<Tz>

Crucially this handles things like daylight savings time, where the timezone offset depends on the time in question

There is an example of this here - apache/arrow-rs#3795

Thank you for the review. Do you mean we can use timestamp_ns_to_datetime() and other similar functions to parse from i64 to DateTime, instead of from_timestamp_opt() which is parsing i64 to NaiveDateTime?
I cannot figure out how this code may have a problem with things like daylight savings time. Isn't it sufficient knowing two things to find the time difference correctly in any circumstances?: 1) numeric value of timestamps 2) corresponding UTC offsets of these timestamps.

corresponding UTC offsets of these timestamps

The timezones may not be of the form "+02:00" they might also be "America/Los_Angeles" in which case the offset to UTC depends on the date in question. See https://github.com/apache/arrow-rs/pull/3801/files#diff-5f92a7816bbae9b685c2f85ab84a268b85246bfaa14272c5afd339810ad471f3R22

My suggestion is to use the chrono DateTime abstraction to handle applying the offset correctly, along with Tz to handle parsing the timezone string correctly.

In particular as_datetime and as_datetime_with_timezone handle the cases of no timezone and a timezone respectively. You could perhaps crib from them, or even use them directly

In the example you mentioned, there is a function string_to_datetime, taking the timestamp and timezone as a whole string. In our case, we have a timestamp as an integer and a string for the timezone. If I try to implement the part where we parse the timezone string to Tz, I need to use FromStr::from_str implemented for Tz. However, there are 2 different implementations of that function inside timezone.rs:

The first one is protected with chrono-tz feature. Even if I add that feature, Tz struct and the functions are under private namespace which is not public.

On the other hand, the second implementation cannot handle timezones such as "America/Los_Angeles", when I try to use it without any modification on arrow and cargo.toml.

I suggest that you can merge this PR, and when the new release that enables us to parse timezone string is announced, I will open a new PR supporting that kind of timezones.

I don't think it will look good, but I can provide this support by duplicating the parts that will work for me in the arrow code.

There is always a FromStr implementation provided, it just becomes "more complete" with the chrono-tz feature enabled. You should be able to use it regardless, give it a go 😄

when the new release that enables us to parse timezone string is announced

This was added months ago, and is in use already within DataFusion

TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄

Sounds good. We will try to make it work with the FromStr impl and if we can not for some reason, we will revisit in a follow-on PR. Thanks for taking a look!

TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄

Thanks for the suggestion. I added the necessary dependencies and features (arrow-array and chrono-tz), and now we can handle the timezones in the form of the example you gave.

alamb

Thank you @berkaysynnada and @ozankabak

I think this PR could be merged once it has tests to verify this behavior is consistent with the arrow kernels. I also think it could be made less verbose using ScalarValue::new_interval_yd, etc as I mention, but I don't think that is a blocker).

Opinion

I think any operation is working on ScalarValue it is likely to be much slower than a similar operation of Array (because of the per-row value overhead) -- aka to go fast DataFusion needs to be vectorized and that typically means using the arrow kernels. I see how having the same functionality in ScalarValue is not unreasonable but I worry it will not push people toward the array implementation. However, this is not a blocker in my opinion.

Adding in more ScalarValue support I think is ok, as long as it doesn't become a maintenance burden.

Concern

My biggest fear is that the two implementations (Array kernels and ScalarValue) will get be inconsistent / get out of sync (for example the timezone handling that @tustvold mentions).

Suggestion for keeping scalar consistent with arrow

Would it be possible to rework these tests so that they also verified that the behavior was consistent with the arrow kernels

For example, in addition to doing lhs.sub(rhs)

also do something like (untested)

// Run scalar value test
let result = lhs.sub(rhs).unwrap();
// cast both to one element arrays 
let array = arrow::cast::sub( lhs.to_array(), rhs.to_array())
assert_eq!(1, array.len());
// verify the array has the same value as the scalar arithmetic
assert_eq!(result.eq_array(&array, 0));

p.s. I read through #5411 and #5412 but it wasn't clear to me what the usecase of this ScalarValue arithmetic was. Is it related to the Window operations?

alamb · 2023-03-17T11:22:35Z

datafusion/common/src/scalar.rs

+    fn test_scalar_interval_add() {
+        let cases = [
+            (
+                ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value(


I think you can use https://docs.rs/datafusion/20.0.0/datafusion/scalar/enum.ScalarValue.html#method.new_interval_ym to reduce this boilerplate substantially

Suggested change

ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value(

ScalarValue::new_interval_ym(1, 12)

The same for most of the other tests

ozankabak · 2023-03-17T21:47:07Z

@alamb, thanks for reviewing. As you correctly guessed, this is for things like determining window boundaries, interval calculations for pruning etc. It will not be used on raw data when executing queries (we will use arrow kernels there).

@berkaysynnada is working on the arrow kernel parts and we will follow on with another PR when it is ready. It will include both end to end tests (i.e. running queries) and the consistency tests you suggested.

I cleaned up the overly verbose interval construction boilerplate, so this is good to go in terms of the feedback so far. We will follow up with the other PR early next week after this merges.

alamb · 2023-03-18T15:13:32Z

datafusion/common/src/scalar.rs

+            (
+                ScalarValue::new_interval_dt(65, 321),
+                ScalarValue::new_interval_mdn(2, 5, 1_000_000),
+                ScalarValue::new_interval_mdn(-2, 60, 320_000_000),


alamb · 2023-03-18T15:14:02Z

Thank you everyone!

tustvold · 2023-03-24T17:57:37Z

datafusion/common/Cargo.toml

@@ -41,10 +41,14 @@ pyarrow = ["pyo3", "arrow/pyarrow"]
 [dependencies]
 apache-avro = { version = "0.14", default-features = false, features = ["snappy"], optional = true }
 arrow = { workspace = true, default-features = false }
+arrow-array = { version = "35.0.0", default-features = false, features = ["chrono-tz"] }


This is depending on arrow-array 35 despite the crate as a whole still depending on arrow 34. This is likely to cause confusion

Fix in #5724

berkaysynnada and others added 22 commits March 7, 2023 10:37

first implementation and tests of timestamp subtraction

1869363

improvement after review

2f01278

postgre interval format option

806b4d3

random tests extended

708d717

corrections after review

c5bacbe

operator check

011933f

flag is removed

e475f58

clippy fix

423fb65

toml conflict

1291758

Merge branch 'main' into feature/time-interval-support

055ed81

minor changes

d7f3696

deterministic matches

8d5c8e3

simplifications (clippy error)

31577d9

test format changed

c274aef

minor test fix

968a682

Merge branch 'main' into feature/time-interval-support

49506ed

Update scalar.rs

ed63779

Refactoring and simplifications

68ea647

Make ScalarValue support interval comparison

ed04466

naming tests

3bf8fd6

macro renaming

0f8a7a7

renaming macro

cf892fe

tustvold reviewed Mar 15, 2023

View reviewed changes

Merge branch 'apache:main' into feature/time-interval-support

21b9282

alamb reviewed Mar 17, 2023

View reviewed changes

berkaysynnada and others added 3 commits March 17, 2023 15:12

Utilize DateTime<Tz> parsing timezone

d91a785

Get rid of boilerplate by using convenience functions

f8ba64c

Get rid of boilerplate by using convenience functions (part 2)

737e22c

alamb approved these changes Mar 18, 2023

View reviewed changes

alamb merged commit 3ccf1ae into apache:main Mar 18, 2023

berkaysynnada deleted the feature/time-interval-support branch March 18, 2023 15:58

andygrove added the enhancement New feature or request label Mar 24, 2023

tustvold reviewed Mar 24, 2023

View reviewed changes

tustvold mentioned this pull request Mar 24, 2023

Use consistent arrow version (do not use both arrrow 34 and arrow-array 35) #5724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp subtraction and interval operations for `ScalarValue` #5603

Timestamp subtraction and interval operations for `ScalarValue` #5603

berkaysynnada commented Mar 14, 2023 •

edited

Loading

tustvold Mar 15, 2023 •

edited

Loading

berkaysynnada Mar 15, 2023

tustvold Mar 15, 2023 •

edited

Loading

berkaysynnada Mar 16, 2023 •

edited

Loading

tustvold Mar 16, 2023 •

edited

Loading

tustvold Mar 16, 2023

ozankabak Mar 16, 2023

berkaysynnada Mar 17, 2023

alamb left a comment

alamb Mar 17, 2023 •

edited

Loading

ozankabak Mar 17, 2023

ozankabak commented Mar 17, 2023

alamb Mar 18, 2023

alamb commented Mar 18, 2023

tustvold Mar 24, 2023

tustvold Mar 24, 2023

	ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value(
	ScalarValue::new_interval_ym(1, 12)

Timestamp subtraction and interval operations for ScalarValue #5603

Timestamp subtraction and interval operations for ScalarValue #5603

Conversation

berkaysynnada commented Mar 14, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

To run end to end queries including timestamp subtraction:

tustvold Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

berkaysynnada Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Mar 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Opinion

Concern

Suggestion for keeping scalar consistent with arrow

alamb Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Mar 17, 2023

Choose a reason for hiding this comment

alamb commented Mar 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Timestamp subtraction and interval operations for `ScalarValue` #5603

Timestamp subtraction and interval operations for `ScalarValue` #5603

berkaysynnada commented Mar 14, 2023 •

edited

Loading

tustvold Mar 15, 2023 •

edited

Loading

tustvold Mar 15, 2023 •

edited

Loading

berkaysynnada Mar 16, 2023 •

edited

Loading

tustvold Mar 16, 2023 •

edited

Loading

alamb Mar 17, 2023 •

edited

Loading