Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp subtraction and interval operations for ScalarValue #5603

Merged
merged 26 commits into from
Mar 18, 2023
Merged

Timestamp subtraction and interval operations for ScalarValue #5603

merged 26 commits into from
Mar 18, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Mar 14, 2023

Which issue does this PR close?

Closes #5411 and #5412.

Rationale for this change

This PR is composed of two parts but served as one PR since the parts are closely related (they share constants, change similar places in the file). 340 lines are added for implementation, and 670 are for tests.

The first support is to timestamp subtraction within ScalarValue. In postgre, the behaviour is like:

CREATE TABLE test (
  id INT,
  ts1 TIMESTAMP,
  ts2 TIMESTAMP
  
);
INSERT INTO test VALUES (1, '2016-03-13 00:00:00', '2017-04-13 12:11:03');
INSERT INTO test VALUES (2, '2016-03-04 06:00:00', '2016-03-01 20:33:48');
INSERT INTO test VALUES (3, '2017-03-04 23:59:59', '2016-03-01 12:00:00');
SELECT
ts2 - ts1 AS diff_ts
FROM test;

image

We can result in the same way with postgre. The difference is shown in days as the largest time unit. However, since we don't have such detailed fields for an interval, TimestampSecond or TimestampMillisecond timestamp subtractions give result in IntervalDayTime variant, and TimestampMicrosecond or TimestampNanosecond timestamp subtractions give result in IntervalMonthDayNano variant without using the month field. However, I need to underline that we can apply this operation only on scalar values. Supporting columnar operations will be in the following PR.

The second support is comparing, adding and subtracting two ScalarValue interval types, IntervalYearMonth, IntervalDayTime, IntervalMonthDayNano. With this PR, we can apply these operations between both the same and different variants. Columnar value support will be in the following PR as well.

What changes are included in this PR?

impl_op match arms are extended to cover timestamp and interval types. It should be noted that we need the same types of timestamps to apply subtraction. impl PartialOrd for ScalarValue and impl PartialEq for ScalarValue are extended to handle interval types. However, to be able to compare months and days, we need to assume a month equals to 30 days (postgre behaves in the same way).

Are these changes tested?

Yes, there are some tests covering edge cases with timezone options, and randomized tests that first add/subtract an interval type to/from a timestamp, then take the difference between the resulting timestamp and the initial timestamp to assert the equality with the given interval. For the interval operations, some tests are written to cover all variations of operations for different types.

timestamp_next timestamp_prev Return as days
2017-04-16 00:00:00 . 000_000_100 2016-03-13 00:00:00 . 000_000_025 months: 0, days: 399, nanos: 75
2016-03-13 00:00:00 . 000_000_025 2017-04-16 00:00:00 . 000_000_100 months: 0, days: -399, nanos: -75
2016-12-16 01:00:00 . 100 2016-12-15 00:00:00 . 000 days: 1, millis: 3600100
2016-12-15 00:00:00 . 000 2016-12-16 01:00:00 . 100 days: -1, millis: -3600100

Are there any user-facing changes?

To run end to end queries including timestamp subtraction:

  1. try_new_impl function in arrow-rs gives an error while validating the schema and columns as such:

column types must match schema types, expected Timestamp(Second, None) but found Interval(DayTime) at column index 0

That expectation of the output column needs to be changed to interval type.

  1. In Datafusion, there is a function coerce_types returning the output type of applying op to an argument of lhs_type and rhs_type. These codes with the extensions in planner.rs and binary.rs also need to be reshaped. evaluate function in datetime.rs has to handle columnar values at each side.

/// This function creates the [`NaiveDateTime`] object corresponding to the
/// given timestamp using the units (tick size) implied by argument `mode`.
#[inline]
fn with_timezone_to_naive_datetime(
Copy link
Contributor

@tustvold tustvold Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably make use of https://docs.rs/arrow-array/latest/arrow_array/timezone/struct.Tz.html to parse to a DateTime<Tz>

Crucially this handles things like daylight savings time, where the timezone offset depends on the time in question

There is an example of this here - apache/arrow-rs#3795

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Do you mean we can use timestamp_ns_to_datetime() and other similar functions to parse from i64 to DateTime, instead of from_timestamp_opt() which is parsing i64 to NaiveDateTime?
I cannot figure out how this code may have a problem with things like daylight savings time. Isn't it sufficient knowing two things to find the time difference correctly in any circumstances?: 1) numeric value of timestamps 2) corresponding UTC offsets of these timestamps.

Copy link
Contributor

@tustvold tustvold Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corresponding UTC offsets of these timestamps

The timezones may not be of the form "+02:00" they might also be "America/Los_Angeles" in which case the offset to UTC depends on the date in question. See https://github.com/apache/arrow-rs/pull/3801/files#diff-5f92a7816bbae9b685c2f85ab84a268b85246bfaa14272c5afd339810ad471f3R22

My suggestion is to use the chrono DateTime abstraction to handle applying the offset correctly, along with Tz to handle parsing the timezone string correctly.

In particular as_datetime and as_datetime_with_timezone handle the cases of no timezone and a timezone respectively. You could perhaps crib from them, or even use them directly

Copy link
Contributor Author

@berkaysynnada berkaysynnada Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example you mentioned, there is a function string_to_datetime, taking the timestamp and timezone as a whole string. In our case, we have a timestamp as an integer and a string for the timezone. If I try to implement the part where we parse the timezone string to Tz, I need to use FromStr::from_str implemented for Tz. However, there are 2 different implementations of that function inside timezone.rs:

The first one is protected with chrono-tz feature. Even if I add that feature, Tz struct and the functions are under private namespace which is not public.

On the other hand, the second implementation cannot handle timezones such as "America/Los_Angeles", when I try to use it without any modification on arrow and cargo.toml.

I suggest that you can merge this PR, and when the new release that enables us to parse timezone string is announced, I will open a new PR supporting that kind of timezones.

I don't think it will look good, but I can provide this support by duplicating the parts that will work for me in the arrow code.

Copy link
Contributor

@tustvold tustvold Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is always a FromStr implementation provided, it just becomes "more complete" with the chrono-tz feature enabled. You should be able to use it regardless, give it a go 😄

when the new release that enables us to parse timezone string is announced

This was added months ago, and is in use already within DataFusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We will try to make it work with the FromStr impl and if we can not for some reason, we will revisit in a follow-on PR. Thanks for taking a look!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄

Thanks for the suggestion. I added the necessary dependencies and features (arrow-array and chrono-tz), and now we can handle the timezones in the form of the example you gave.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada and @ozankabak

I think this PR could be merged once it has tests to verify this behavior is consistent with the arrow kernels. I also think it could be made less verbose using ScalarValue::new_interval_yd, etc as I mention, but I don't think that is a blocker).

Opinion

I think any operation is working on ScalarValue it is likely to be much slower than a similar operation of Array (because of the per-row value overhead) -- aka to go fast DataFusion needs to be vectorized and that typically means using the arrow kernels. I see how having the same functionality in ScalarValue is not unreasonable but I worry it will not push people toward the array implementation. However, this is not a blocker in my opinion.

Adding in more ScalarValue support I think is ok, as long as it doesn't become a maintenance burden.

Concern

My biggest fear is that the two implementations (Array kernels and ScalarValue) will get be inconsistent / get out of sync (for example the timezone handling that @tustvold mentions).

Suggestion for keeping scalar consistent with arrow

Would it be possible to rework these tests so that they also verified that the behavior was consistent with the arrow kernels

For example, in addition to doing lhs.sub(rhs)

also do something like (untested)

// Run scalar value test
let result = lhs.sub(rhs).unwrap();
// cast both to one element arrays 
let array = arrow::cast::sub( lhs.to_array(), rhs.to_array())
assert_eq!(1, array.len());
// verify the array has the same value as the scalar arithmetic
assert_eq!(result.eq_array(&array, 0));

p.s. I read through #5411 and #5412 but it wasn't clear to me what the usecase of this ScalarValue arithmetic was. Is it related to the Window operations?

fn test_scalar_interval_add() {
let cases = [
(
ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value(
Copy link
Contributor

@alamb alamb Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use https://docs.rs/datafusion/20.0.0/datafusion/scalar/enum.ScalarValue.html#method.new_interval_ym to reduce this boilerplate substantially

Suggested change
ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value(
ScalarValue::new_interval_ym(1, 12)

The same for most of the other tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ozankabak
Copy link
Contributor

@alamb, thanks for reviewing. As you correctly guessed, this is for things like determining window boundaries, interval calculations for pruning etc. It will not be used on raw data when executing queries (we will use arrow kernels there).

@berkaysynnada is working on the arrow kernel parts and we will follow on with another PR when it is ready. It will include both end to end tests (i.e. running queries) and the consistency tests you suggested.

I cleaned up the overly verbose interval construction boilerplate, so this is good to go in terms of the feedback so far. We will follow up with the other PR early next week after this merges.

(
ScalarValue::new_interval_dt(65, 321),
ScalarValue::new_interval_mdn(2, 5, 1_000_000),
ScalarValue::new_interval_mdn(-2, 60, 320_000_000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb merged commit 3ccf1ae into apache:main Mar 18, 2023
@alamb
Copy link
Contributor

alamb commented Mar 18, 2023

Thank you everyone!

@berkaysynnada berkaysynnada deleted the feature/time-interval-support branch March 18, 2023 15:58
@andygrove andygrove added the enhancement New feature or request label Mar 24, 2023
@@ -41,10 +41,14 @@ pyarrow = ["pyo3", "arrow/pyarrow"]
[dependencies]
apache-avro = { version = "0.14", default-features = false, features = ["snappy"], optional = true }
arrow = { workspace = true, default-features = false }
arrow-array = { version = "35.0.0", default-features = false, features = ["chrono-tz"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is depending on arrow-array 35 despite the crate as a whole still depending on arrow 34. This is likely to cause confusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix in #5724

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp Subtraction
5 participants