Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add temporal kernels for arithmetic with timestamps and durations #527

Closed
Tracked by #3148
alamb opened this issue Jul 6, 2021 · 13 comments · Fixed by #4465
Closed
Tracked by #3148

Add temporal kernels for arithmetic with timestamps and durations #527

alamb opened this issue Jul 6, 2021 · 13 comments · Fixed by #4465
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 6, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I am trying to arithmetically create date / time columns. In my case I want to add some duration (7 days) to a Date32 column

Describe the solution you'd like

I would like temporal kernels that allowed operations such as:

TimestampNanosecondArray + Duration --> TimestampNanosecondArray
Date32Array + Duration --> Date32Array
etc.

I would like to be able to do something like

    let date_strings = vec![
        Some("2020-01-01"),
        Some("2020-01-02"),
        Some("2020-01-03"),
        None,
        Some("2020-01-04"),
    ];
    let arr_date32 = arrow::compute::cast(&arr_string, &DataType::Date32).expect("casting to date");

    // note there is no such thing as DurationArray at this time
    let offset_seconds = DurationArray::from_days(vec![1, 2, 3, 4, 5]);
    let arr_date32_offset = compute::kernels::add(arr_date32, offset_seconds)

    // expect the output is 
    // 2021-01-02
    // 2021-01-04
    // 2021-01-06
    // None
    // 2021-01-08

Describe alternatives you've considered
It might be nice to extend the arithmetic kernels to take Arc<dyn Array> rather than the PrimitiveArray as they do now and do the casting / calling specific implementations directly.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 6, 2021
@alamb
Copy link
Contributor Author

alamb commented Jul 6, 2021

FYI @velvia

@jimexist
Copy link
Member

anyone taking this? also is there any reference implementation in other languages?

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2021

@jimexist no one that I know of is taking this on. If you were interested that would be awesome

The reference implementation would probably still be postgres in my mind. There is a nice table https://www.postgresql.org/docs/current/functions-datetime.html#OPERATORS-DATETIME-TABLE that describes all the operators, input types, and output types. I think it helps because certain combinations, such as date + date don't really make logical sense so having the reference to guide what is needed would help a lot

@velvia / @andygrove / @jorgecarleitao any thoughts?

@nevi-me
Copy link
Contributor

nevi-me commented Jul 14, 2021

this also relates to #45

@velvia
Copy link
Contributor

velvia commented Jul 15, 2021

This is definitely a good feature.

It seems to me that in most cases, that time/duration manipulations could be based on existing arithmetic kernels, that'd be one way to go about it.
ie some combination of cast(date32 -> i32) -> math add kernel -> cast back for example.

Or at least, leveraging existing math kernels would make a ton of sense.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2021

Or at least, leveraging existing math kernels would make a ton of sense.

I agree -- it may be that since the date/time arrays are PrimitiveArrays it might already work. But I remember problems when I tried.

My selfish initial desire is to be able to write a filter like "the last 5 minutes" in DataFusion so something like

SELECT ...
WHERE
   timestamp_column > (now() - interval `5 minutes')

@jorgecarleitao
Copy link
Member

I would say that we need to generalize the compute arithmetic kernels to support &dyn Array and then specialize based on DataType or bail if the operation is not valid for that datatype.

We do this in arrow2 thanks to @elferherrera, here: https://github.com/jorgecarleitao/arrow2/blob/main/src/compute/arithmetics/mod.rs#L126

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2021

I would say that we need to generalize the compute arithmetic kernels to support &dyn Array and then specialize based on DataType or bail if the operation is not valid for that datatype.

I very much support this notion -- I have filed #558 to track it.

I can take a crack at the proposed interface if this seems reasonable to people

@bjchambers
Copy link
Contributor

I have some kernels doing time arithmetic for a personal project. I think it is helpful to consider more than just durations since, those are the easiest case. Specifically, we should also be able to add intervals to timestamps.

I think it may be difficult to do these using existing arithmetic kernels. For the case of fixed durations (5000 seconds) it may be possible, but when handling intervals we'll need special kernels (eg., add 5 calendar days). One possible interface would be something like add_time<Time, Delta>(time: &PrimitiveArray<Time>, delta: &PrimitiveArray<Delta>) -> PrimitiveArray<Time>, for some traits that define temporal types and deltas.

Then we can use chrono and chronoutil to support adding a duration or relative duration by having implementations of something like an ArrowTimeDelta trait for the data types that represent fixed or relative durations.

@alamb
Copy link
Contributor Author

alamb commented Jul 22, 2021

Yeah, handling intervals will take some finagling / handling intervals that aren't fixed numbers of seconds.

However, I think the type system can represent such durations (using a DataType::Duration: https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L97) so we can perhaps use PrimitiveArray<Duration> when you refer to Delta above

@bjchambers
Copy link
Contributor

I don't think that entirely works for two reasons -- first, there are actually multiple Duration types (in Rust) for the same DataType. Second, the delta may also be an interval. Specifically, it needs to be constrained in a way that supports the following arrays:

  • PrimitiveArray<DurationSecondType> - add a fixed number of seconds
  • PrimitiveArray<DurationMicrosecondType> - add a fixed number of microseconds
  • PrimitiveArray<DurationMillisecondType> - add a fixed number of milliseconds
  • PrimitiveArray<DurationNanosecondType> - add a fixed number of nanoseconds
  • PrimitiveArray<IntervalDayTimeType> - add a number of calendar days (and seconds) (description says this is a pair of i32 packed into an i64)
  • PrimitiveArray<IntervalYearMonthType> - add a number of calendar months

From an implementation, I think we can do this using chrono and chronoutil. Something like:

pub trait ArrowTimeDelta: ArrowPrimitiveType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime;
}

impl ArrowTimeDelta<RelativeDuration> for IntervalDayTimeType {
    fn get_delta(value: Self::Native) -> RelativeDuration {
        // DayTime is represented as a 64 bit value -- 32 bit day and 32 bit seconds.
        let days = RelativeDuration::days(value >> 32);
        let seconds = RelativeDuration::seconds((value as i32) as i64);
        days + seconds
    }
}

impl ArrowTimeDelta for IntervalYearMonthType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime {
        // YearMonth is represented as a 32 bit value containing the number of months.
        time + RelativeDuration::months(value)
    }
}

impl ArrowTimeDelta for DurationSecondType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime {
        time + Duration::seconds(value)
    }
}

impl ArrowTimeDelta for DurationMillisecondType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime {
        time + Duration::milliseconds(value)
    }
}

impl ArrowTimeDelta for DurationMicrosecondType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime {
        time + Duration::microseconds(value)
    }
}

impl ArrowTimeDelta for DurationNanosecondType {
    fn add_to(time: NaiveDateTime, value: Self::Native) -> NaiveDateTime {
        time + Duration::nanoseconds(value)
    }
}

Then the implementation is basically iterating over a pair of arrays and callling add_to.

@alamb
Copy link
Contributor Author

alamb commented Jul 23, 2021

From an implementation, I think we can do this using chrono and chronoutil. Something like:

That looks like a great idea to me. Nice

@bjchambers
Copy link
Contributor

Added a rough draft. The temporal values are a bit odd -- it looks like there is as_datetime but that doesn't respect the timestamp of the input, and seems like it would be doing a match inside the loop, so I added a method to ArrowTimestampType. Not sure if there is a better way to operate on the date time -- maybe a method that returns a closure that gets inlined away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
7 participants