Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cast from/to interval and duration #4020

Merged
merged 9 commits into from
Apr 16, 2023

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Apr 5, 2023

Which issue does this PR close?

Closes #3998

Rationale for this change

Explain #3998

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 5, 2023
@Weijun-H Weijun-H force-pushed the cast-between-duration-interval branch from 9082dff to 6d58d1c Compare April 5, 2023 10:01
array: &dyn Array,
cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let array = array.as_any().downcast_ref::<PrimitiveArray<D>>().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we propagate the error?

Comment on lines 524 to 527
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null();
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For array with no nulls, I think we can have a faster path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveArray::from_trusted_len_iter may fit here.

builder.append_null();
} else {
let v = array.value(i) as i128;
let v = v.mul_checked(scale);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scale 1 case, maybe we can skip this call?

return Err(ArrowError::ComputeError(format!(
"Cannot cast to {:?}. Overflowing on {:?}",
D::DATA_TYPE,
e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e is ArrowError, so you will get nested ArrowError.

@metesynnada
Copy link
Contributor

LGTM, thanks for the effort.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks as well for this PR @Weijun-H -- it is getting there. I am not sure the overflow handling is handled correctly. I left some comments and suggested tests that I think will help get this PR ready to go

if cast_options.safe {
let iter = array
.iter()
.map(|v| v.and_then(|v| ((v as i128).checked_mul(scale))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this code correctly converts v into an IntervalMonthDayNanoType -- the IntervalMonthDayNanoType is represented as two fields but this is treating it as a single native type)

I think we should construct IntervalMonthDayNano with this method:

https://docs.rs/arrow/37.0.0/arrow/datatypes/struct.IntervalMonthDayNanoType.html#method.make_value

So something like

let months = 0;
let days = 0;
let nanos: i64 = v.checked_mul(scale)?;
Ok(IntervalMonthDayNanoType::make_value(months, days, nanos))

#[test]
fn test_cast_from_duration_to_interval() {
// from duration second to interval month day nano
let array = vec![1234567];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please test a value if i64::MAX here and ensure an error is returned?

assert_eq!(casted_array.value(0), 0);

// from interval month day nano to duration millisecond
let array = vec![1234567];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably refactor this repetition into a function and avoid so much boiler plate in the tests

&CastOptions { safe: false },
)
.unwrap();
assert_eq!(casted_array.value(0), 9223372036854775807000000);
Copy link
Contributor

@alamb alamb Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I convert this value back into fields:

    let (months, days, nanos) = IntervalMonthDayNanoType::to_parts(9223372036854775807000000);
    println!("months: {months}, days: {days}, nanos: {nanos}");

it prints out:

months: 0, days: 499999, nanos: -1000000

Which doesn't seem right for 9223372036854775807 milliseconds

I think what is happening can be explained here (the nanoseconds fields have overflowed into the days)

https://docs.rs/arrow-array/37.0.0/src/arrow_array/types.rs.html#435-445

        /*
        https://github.com/apache/arrow/blob/02c8598d264c839a5b5cf3109bfd406f3b8a6ba5/cpp/src/arrow/type.h#L1475
        struct MonthDayNanos {
            int32_t months;
            int32_t days;
            int64_t nanoseconds;
        }
        128     112     96      80      64      48      32      16      0
        +-------+-------+-------+-------+-------+-------+-------+-------+
        |     months    |      days     |             nanos             |
        +-------+-------+-------+-------+-------+-------+-------+-------+
        */

I think what should happen is

  1. (preferred): the cast should return an error if the nanosecond field would have overflowed
  2. (secondarily): if the nanosecond field overflows the overflow should be converted into days

I think erroring (at least at first) is preferred because it avoids questions like "how many seconds are there in a day" which is not actually a fixed quantity given leap seconds, etc.

If someone has need of casting such durations to intervals, we can then sort out the mechanics

@alamb
Copy link
Contributor

alamb commented Apr 15, 2023

Thanks @Weijun-H -- I think this is getting close. From my perspecitive all that is left is handling nanosecond overflow correctly and this will be good to go

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bearing with my @Weijun-H -- sorry for the back and forth

if cast_options.safe {
let iter = array
.iter()
.map(|v| v.and_then(|v| v.checked_mul(scale).map(|v| v as i128)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check the overflow in the safe case as well -- and and set the result to Null / NONE when overflow happened

);
assert_eq!(casted_array.value(0), 1234567000);

let array = vec![i64::MAX];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add a test here (and the cases below) showing what happens when an overflow happens with DEFAULT_CAST_OPTIONS? I would expect the result to be Null

&DEFAULT_CAST_OPTIONS,
)
.unwrap();
assert_eq!(casted_array.value(0), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should be that the casted_array.valid(0) is false, right?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ready to go. I don't totally understand why https://github.com/apache/arrow-rs/pull/4020/files#r1167579187 is not needed, but the test coverage

        let array = vec![i64::MAX];
        let casted_array = cast_from_duration_to_interval::<DurationMillisecondType>(
            array.clone(),
            &DEFAULT_CAST_OPTIONS,
        )
        .unwrap();
        assert!(!casted_array.is_valid(0));

covers the case I was thinking of so 👍

Thank you @Weijun-H

@Weijun-H
Copy link
Member Author

I think this PR is ready to go. I don't totally understand why #4020 (files) is not needed, but the test coverage

        let array = vec![i64::MAX];
        let casted_array = cast_from_duration_to_interval::<DurationMillisecondType>(
            array.clone(),
            &DEFAULT_CAST_OPTIONS,
        )
        .unwrap();
        assert!(!casted_array.is_valid(0));

covers the case I was thinking of so 👍

Thank you @Weijun-H

Because v is i64 in #4020 (files), checked_mul will return None when overflowing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting to/from Interval and Duration
4 participants