-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamp subtraction and interval operations for ScalarValue
#5603
Conversation
/// This function creates the [`NaiveDateTime`] object corresponding to the | ||
/// given timestamp using the units (tick size) implied by argument `mode`. | ||
#[inline] | ||
fn with_timezone_to_naive_datetime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably make use of https://docs.rs/arrow-array/latest/arrow_array/timezone/struct.Tz.html to parse to a DateTime<Tz>
Crucially this handles things like daylight savings time, where the timezone offset depends on the time in question
There is an example of this here - apache/arrow-rs#3795
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Do you mean we can use timestamp_ns_to_datetime()
and other similar functions to parse from i64 to DateTime, instead of from_timestamp_opt()
which is parsing i64 to NaiveDateTime?
I cannot figure out how this code may have a problem with things like daylight savings time. Isn't it sufficient knowing two things to find the time difference correctly in any circumstances?: 1) numeric value of timestamps 2) corresponding UTC offsets of these timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corresponding UTC offsets of these timestamps
The timezones may not be of the form "+02:00"
they might also be "America/Los_Angeles"
in which case the offset to UTC depends on the date in question. See https://github.com/apache/arrow-rs/pull/3801/files#diff-5f92a7816bbae9b685c2f85ab84a268b85246bfaa14272c5afd339810ad471f3R22
My suggestion is to use the chrono DateTime
abstraction to handle applying the offset correctly, along with Tz
to handle parsing the timezone string correctly.
In particular as_datetime and as_datetime_with_timezone handle the cases of no timezone and a timezone respectively. You could perhaps crib from them, or even use them directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the example you mentioned, there is a function string_to_datetime
, taking the timestamp and timezone as a whole string. In our case, we have a timestamp as an integer and a string for the timezone. If I try to implement the part where we parse the timezone string to Tz
, I need to use FromStr::from_str
implemented for Tz. However, there are 2 different implementations of that function inside timezone.rs:
The first one is protected with chrono-tz feature. Even if I add that feature, Tz struct and the functions are under private
namespace which is not public.
On the other hand, the second implementation cannot handle timezones such as "America/Los_Angeles"
, when I try to use it without any modification on arrow and cargo.toml.
I suggest that you can merge this PR, and when the new release that enables us to parse timezone string is announced, I will open a new PR supporting that kind of timezones.
I don't think it will look good, but I can provide this support by duplicating the parts that will work for me in the arrow code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is always a FromStr
implementation provided, it just becomes "more complete" with the chrono-tz feature enabled. You should be able to use it regardless, give it a go 😄
when the new release that enables us to parse timezone string is announced
This was added months ago, and is in use already within DataFusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We will try to make it work with the FromStr
impl and if we can not for some reason, we will revisit in a follow-on PR. Thanks for taking a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBC I don't think this blocks this PR, if one of the DF reviewers is happy with it, it was just an idle suggestion as it would save reimplementing parsing and timezone handling logic that already exists upstream, and ensures timezones are handled consistently 😄
Thanks for the suggestion. I added the necessary dependencies and features (arrow-array and chrono-tz), and now we can handle the timezones in the form of the example you gave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada and @ozankabak
I think this PR could be merged once it has tests to verify this behavior is consistent with the arrow kernels. I also think it could be made less verbose using ScalarValue::new_interval_yd
, etc as I mention, but I don't think that is a blocker).
Opinion
I think any operation is working on ScalarValue it is likely to be much slower than a similar operation of Array
(because of the per-row value overhead) -- aka to go fast DataFusion needs to be vectorized and that typically means using the arrow kernels. I see how having the same functionality in ScalarValue is not unreasonable but I worry it will not push people toward the array implementation. However, this is not a blocker in my opinion.
Adding in more ScalarValue support I think is ok, as long as it doesn't become a maintenance burden.
Concern
My biggest fear is that the two implementations (Array
kernels and ScalarValue
) will get be inconsistent / get out of sync (for example the timezone handling that @tustvold mentions).
Suggestion for keeping scalar consistent with arrow
Would it be possible to rework these tests so that they also verified that the behavior was consistent with the arrow kernels
For example, in addition to doing lhs.sub(rhs)
also do something like (untested)
// Run scalar value test
let result = lhs.sub(rhs).unwrap();
// cast both to one element arrays
let array = arrow::cast::sub( lhs.to_array(), rhs.to_array())
assert_eq!(1, array.len());
// verify the array has the same value as the scalar arithmetic
assert_eq!(result.eq_array(&array, 0));
p.s. I read through #5411 and #5412 but it wasn't clear to me what the usecase of this ScalarValue arithmetic was. Is it related to the Window operations?
datafusion/common/src/scalar.rs
Outdated
fn test_scalar_interval_add() { | ||
let cases = [ | ||
( | ||
ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use https://docs.rs/datafusion/20.0.0/datafusion/scalar/enum.ScalarValue.html#method.new_interval_ym to reduce this boilerplate substantially
ScalarValue::IntervalYearMonth(Some(IntervalYearMonthType::make_value( | |
ScalarValue::new_interval_ym(1, 12) |
The same for most of the other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@alamb, thanks for reviewing. As you correctly guessed, this is for things like determining window boundaries, interval calculations for pruning etc. It will not be used on raw data when executing queries (we will use arrow kernels there). @berkaysynnada is working on the arrow kernel parts and we will follow on with another PR when it is ready. It will include both end to end tests (i.e. running queries) and the consistency tests you suggested. I cleaned up the overly verbose interval construction boilerplate, so this is good to go in terms of the feedback so far. We will follow up with the other PR early next week after this merges. |
( | ||
ScalarValue::new_interval_dt(65, 321), | ||
ScalarValue::new_interval_mdn(2, 5, 1_000_000), | ||
ScalarValue::new_interval_mdn(-2, 60, 320_000_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thank you everyone! |
@@ -41,10 +41,14 @@ pyarrow = ["pyo3", "arrow/pyarrow"] | |||
[dependencies] | |||
apache-avro = { version = "0.14", default-features = false, features = ["snappy"], optional = true } | |||
arrow = { workspace = true, default-features = false } | |||
arrow-array = { version = "35.0.0", default-features = false, features = ["chrono-tz"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is depending on arrow-array 35 despite the crate as a whole still depending on arrow 34. This is likely to cause confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix in #5724
Which issue does this PR close?
Closes #5411 and #5412.
Rationale for this change
This PR is composed of two parts but served as one PR since the parts are closely related (they share constants, change similar places in the file). 340 lines are added for implementation, and 670 are for tests.
The first support is to timestamp subtraction within
ScalarValue
. In postgre, the behaviour is like:We can result in the same way with postgre. The difference is shown in days as the largest time unit. However, since we don't have such detailed fields for an interval,
TimestampSecond
orTimestampMillisecond
timestamp subtractions give result inIntervalDayTime
variant, andTimestampMicrosecond
orTimestampNanosecond
timestamp subtractions give result inIntervalMonthDayNano
variant without using the month field. However, I need to underline that we can apply this operation only on scalar values. Supporting columnar operations will be in the following PR.The second support is comparing, adding and subtracting two
ScalarValue
interval types,IntervalYearMonth
,IntervalDayTime
,IntervalMonthDayNano
. With this PR, we can apply these operations between both the same and different variants. Columnar value support will be in the following PR as well.What changes are included in this PR?
impl_op
match arms are extended to cover timestamp and interval types. It should be noted that we need the same types of timestamps to apply subtraction.impl PartialOrd for ScalarValue
andimpl PartialEq for ScalarValue
are extended to handle interval types. However, to be able to compare months and days, we need to assume a month equals to 30 days (postgre behaves in the same way).Are these changes tested?
Yes, there are some tests covering edge cases with timezone options, and randomized tests that first add/subtract an interval type to/from a timestamp, then take the difference between the resulting timestamp and the initial timestamp to assert the equality with the given interval. For the interval operations, some tests are written to cover all variations of operations for different types.
Are there any user-facing changes?
To run end to end queries including timestamp subtraction:
try_new_impl
function in arrow-rs gives an error while validating the schema and columns as such:That expectation of the output column needs to be changed to interval type.
coerce_types
returning the output type of applyingop
to an argument oflhs_type
andrhs_type
. These codes with the extensions in planner.rs and binary.rs also need to be reshaped.evaluate
function in datetime.rs has to handle columnar values at each side.