Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for casting Utf8 and LargeUtf8 --> Interval #3762

Merged
merged 15 commits into from
Mar 7, 2023

Conversation

doki23
Copy link
Contributor

@doki23 doki23 commented Feb 25, 2023

Which issue does this PR close?

Closes #3643.

@doki23 doki23 changed the title Cast string interval Cast string to interval Feb 25, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 25, 2023
@doki23 doki23 marked this pull request as ready for review February 27, 2023 15:34
@doki23
Copy link
Contributor Author

doki23 commented Feb 27, 2023

It seems that the parse function is too complex to get auto-vectorized.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @doki23 -- this is looking very close.

I think this PR needs a few more tests, and I suggest removing the assert_contains but otherwise I think it looks 👌

Thank you very much

* SECONDS_PER_HOUR
* NANOS_PER_SECOND;

// Convert to higher units as much as possible
Copy link
Contributor Author

@doki23 doki23 Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thank you for your review and I made some changes here. For instance, it converts 31 days to 31 days before, but now, it will return 1 month 1 day. I think it's more ergonomic.

Copy link
Contributor

@tustvold tustvold Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, the number of days in a month is not fixed. The postgres docs make an exception for the case of fractional dates, but I don't see any indication it does this in the general case

@alamb alamb changed the title Cast string to interval Add support for casting Utf8 and LargeUtf8 --> Interval Mar 4, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @doki23 -- I think this PR looks great and is ready to go. ❤️

@alamb alamb changed the title Add support for casting Utf8 and LargeUtf8 --> Interval Support for casting Utf8 and LargeUtf8 --> Interval Mar 4, 2023
mut day_part: f64,
mut nanos_part: f64,
) -> (i64, i64, f64) {
// Convert fractional month to days, It's not supported by Arrow types, but anyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? It seems to assume a month has 30 days, which isn't true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this doc it's correct. But you remind me that maybe it's incorrect to spill smaller units to larger units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular

Field values can have fractional parts: for example, '1.5 weeks' or '01:02:03.45'. However, because interval internally stores only three integer units (months, days, microseconds), fractional units must be spilled to smaller units. Fractional parts of units greater than months are rounded to be an integer number of months, e.g. '1.5 years' becomes '1 year 6 mons'. Fractional parts of weeks and days are computed to be an integer number of days and microseconds, assuming 30 days per month and 24 hours per day, e.g., '1.75 months' becomes 1 mon 22 days 12:00:00. Only seconds will ever be shown as fractional on output.

Perhaps we could add a note, with a link?

day_part += (month_part - (month_part as i64) as f64) * 30_f64;

// Convert fractional days to hours
nanos_part += (day_part - ((day_part as i64) as f64))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this truncation logic is correct for negative quantities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, I've added an unit test for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a test that has a negative fractional number of days or months? Am I being blind?

@@ -871,6 +1097,117 @@ mod tests {
);
}

#[test]
fn test_parse_interval() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a test of the negative fractional quantities e.g. -1.1 month

Comment on lines 548 to 549
// @todo It's better to use Decimal in order to protect rounding errors
// Wait https://github.com/apache/arrow/pull/9232
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// @todo It's better to use Decimal in order to protect rounding errors
// Wait https://github.com/apache/arrow/pull/9232
// TODO: Use fixed-point arithmetic to avoid truncation and rounding errors (#3809)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other than a test of negative fractional months and days, this looks good to me. I've filed #3809 to track altering this to use fixed-point arithmetic to avoid the current issues around truncation and overflow

Edit: I think we also need to remove the logic that "promotes" from days to months

Comment on lines 666 to 669
day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64;
month_part += ((day_part as i64) / 30_i64) as f64;
nanos_part %= NANOS_PER_DAY;
day_part %= 30_f64;
Copy link
Contributor

@tustvold tustvold Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64;
month_part += ((day_part as i64) / 30_i64) as f64;
nanos_part %= NANOS_PER_DAY;
day_part %= 30_f64;

I would suggest removing this, not only is it potentially incorrect as months don't have a fixed number of days, but also integer division is very slow (although LLVM may be smart enough to convert this to fixed point multiplication)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of removing the "convert to higher units" logic and and file a ticket to (potentially) support it

I am happy to file such a ticket

I agree the idea of 40 days --> 1 month and 10 days that this PR will do, doesn't seem correct

Copy link
Contributor Author

@doki23 doki23 Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@alamb
Copy link
Contributor

alamb commented Mar 6, 2023

BTW thank you for sticking with this @doki23 -- I didn't realize how much more work would be needed after a more thorough review in arrow-rs 😓

@alamb
Copy link
Contributor

alamb commented Mar 6, 2023

Also, thank you @tustvold for the thorough review!


assert_eq!(
(-1i32, -18i32, (-0.2 * NANOS_PER_DAY) as i64),
parse_interval("months", "-1.5 months -3.2 days").unwrap(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold Here it is.

@doki23
Copy link
Contributor Author

doki23 commented Mar 7, 2023

@alamb I'm glad to stick with this because your(you and @tustvold) reviews bring me some programing inspirations,it is fun to me.

@tustvold tustvold merged commit 14544fb into apache:master Mar 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 7, 2023

Thank you for this 👍

@alamb
Copy link
Contributor

alamb commented Mar 7, 2023

Thanks again @doki23 and @tustvold

MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Dec 5, 2023
* cast string to interval

* cast string to interval

* unit tests

* fix

* update

* code clean

* update unit tests and align_interval_parts

* fix ut

* make clippy happy

* Update arrow-cast/src/parse.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* change return types of calculate_from_part and fix bug of align_interval_parts

* make clippy happy

* remote useless overflow check

* remove the "convert to higher units" logic

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Dec 8, 2023
* cast string to interval

* cast string to interval

* unit tests

* fix

* update

* code clean

* update unit tests and align_interval_parts

* fix ut

* make clippy happy

* Update arrow-cast/src/parse.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* change return types of calculate_from_part and fix bug of align_interval_parts

* make clippy happy

* remote useless overflow check

* remove the "convert to higher units" logic

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
* cast string to interval

* cast string to interval

* unit tests

* fix

* update

* code clean

* update unit tests and align_interval_parts

* fix ut

* make clippy happy

* Update arrow-cast/src/parse.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* change return types of calculate_from_part and fix bug of align_interval_parts

* make clippy happy

* remote useless overflow check

* remove the "convert to higher units" logic

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
* cast string to interval

* cast string to interval

* unit tests

* fix

* update

* code clean

* update unit tests and align_interval_parts

* fix ut

* make clippy happy

* Update arrow-cast/src/parse.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* change return types of calculate_from_part and fix bug of align_interval_parts

* make clippy happy

* remote useless overflow check

* remove the "convert to higher units" logic

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
mcheshkov pushed a commit to cube-js/arrow-rs that referenced this pull request Aug 21, 2024
* cast string to interval

* cast string to interval

* unit tests

* fix

* update

* code clean

* update unit tests and align_interval_parts

* fix ut

* make clippy happy

* Update arrow-cast/src/parse.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* change return types of calculate_from_part and fix bug of align_interval_parts

* make clippy happy

* remote useless overflow check

* remove the "convert to higher units" logic

---------

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support cast <> String to interval
3 participants