-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for casting Utf8
and LargeUtf8
--> Interval
#3762
Conversation
It seems that the parse function is too complex to get auto-vectorized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @doki23 -- this is looking very close.
I think this PR needs a few more tests, and I suggest removing the assert_contains
but otherwise I think it looks 👌
Thank you very much
arrow-cast/src/parse.rs
Outdated
* SECONDS_PER_HOUR | ||
* NANOS_PER_SECOND; | ||
|
||
// Convert to higher units as much as possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thank you for your review and I made some changes here. For instance, it converts 31 days to 31 days before, but now, it will return 1 month 1 day. I think it's more ergonomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct, the number of days in a month is not fixed. The postgres docs make an exception for the case of fractional dates, but I don't see any indication it does this in the general case
Utf8
and LargeUtf8
--> Interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @doki23 -- I think this PR looks great and is ready to go. ❤️
Utf8
and LargeUtf8
--> Interval
Utf8
and LargeUtf8
--> Interval
mut day_part: f64, | ||
mut nanos_part: f64, | ||
) -> (i64, i64, f64) { | ||
// Convert fractional month to days, It's not supported by Arrow types, but anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? It seems to assume a month has 30 days, which isn't true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this doc it's correct. But you remind me that maybe it's incorrect to spill smaller units to larger units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular
Field values can have fractional parts: for example, '1.5 weeks' or '01:02:03.45'. However, because interval internally stores only three integer units (months, days, microseconds), fractional units must be spilled to smaller units. Fractional parts of units greater than months are rounded to be an integer number of months, e.g. '1.5 years' becomes '1 year 6 mons'. Fractional parts of weeks and days are computed to be an integer number of days and microseconds, assuming 30 days per month and 24 hours per day, e.g., '1.75 months' becomes 1 mon 22 days 12:00:00. Only seconds will ever be shown as fractional on output.
Perhaps we could add a note, with a link?
day_part += (month_part - (month_part as i64) as f64) * 30_f64; | ||
|
||
// Convert fractional days to hours | ||
nanos_part += (day_part - ((day_part as i64) as f64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this truncation logic is correct for negative quantities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok, I've added an unit test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see a test that has a negative fractional number of days or months? Am I being blind?
@@ -871,6 +1097,117 @@ mod tests { | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_parse_interval() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have a test of the negative fractional quantities e.g. -1.1 month
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
arrow-cast/src/parse.rs
Outdated
// @todo It's better to use Decimal in order to protect rounding errors | ||
// Wait https://github.com/apache/arrow/pull/9232 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// @todo It's better to use Decimal in order to protect rounding errors | |
// Wait https://github.com/apache/arrow/pull/9232 | |
// TODO: Use fixed-point arithmetic to avoid truncation and rounding errors (#3809) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think other than a test of negative fractional months and days, this looks good to me. I've filed #3809 to track altering this to use fixed-point arithmetic to avoid the current issues around truncation and overflow
Edit: I think we also need to remove the logic that "promotes" from days to months
arrow-cast/src/parse.rs
Outdated
day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64; | ||
month_part += ((day_part as i64) / 30_i64) as f64; | ||
nanos_part %= NANOS_PER_DAY; | ||
day_part %= 30_f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64; | |
month_part += ((day_part as i64) / 30_i64) as f64; | |
nanos_part %= NANOS_PER_DAY; | |
day_part %= 30_f64; |
I would suggest removing this, not only is it potentially incorrect as months don't have a fixed number of days, but also integer division is very slow (although LLVM may be smart enough to convert this to fixed point multiplication)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of removing the "convert to higher units" logic and and file a ticket to (potentially) support it
I am happy to file such a ticket
I agree the idea of 40 days
--> 1 month and 10 days
that this PR will do, doesn't seem correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
BTW thank you for sticking with this @doki23 -- I didn't realize how much more work would be needed after a more thorough review in arrow-rs 😓 |
Also, thank you @tustvold for the thorough review! |
|
||
assert_eq!( | ||
(-1i32, -18i32, (-0.2 * NANOS_PER_DAY) as i64), | ||
parse_interval("months", "-1.5 months -3.2 days").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold Here it is.
Thank you for this 👍 |
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
Which issue does this PR close?
Closes #3643.