-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MONTH_DAY_NANO interval type, impl ArrowNativeType
for i128
#779
Conversation
Codecov Report
@@ Coverage Diff @@
## master #779 +/- ##
==========================================
- Coverage 82.31% 82.23% -0.08%
==========================================
Files 168 168
Lines 49031 49116 +85
==========================================
+ Hits 40360 40392 +32
- Misses 8671 8724 +53
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @b41sh !
Thank you for the contribution @b41sh !
I took a brief look at this code and it looks quite good. @jorgecarleitao / @nevi-me what do you think?
cc @ovr
parquet/src/arrow/converter.rs
Outdated
Ok(builder.finish()) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be possible to implement some tests for this logic (showing for example that round tripping from IntervalMonthDayNanoArray
to FixedLenByteArray
and back produces an equivalent array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will add some tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntervalMonthDayNanoArray
can't converted to parquet format, as Interval
in parquet is fixed 12 bytes, but IntervalMonthDayNano
is 16 bytes. I have deleted parquet related codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @b41sh and sorry for the delay in review.
I went through this PR carefully, and I think it adds the basic plumbing for MONTH_DAY_NANO -- while there is more to be done, I think it is a good step forward.
Note that since this adds new values to the IntervalUnit
enum
s it is not "backwards compatible" in the semver sense and thus will have to wait for the next major release (arrow-6.0)
@@ -38,7 +38,7 @@ path = "src/lib.rs" | |||
[dependencies] | |||
serde = { version = "1.0", features = ["rc"] } | |||
serde_derive = "1.0" | |||
serde_json = { version = "1.0", features = ["preserve_order"] } | |||
serde_json = { version = "1.0", features = ["preserve_order", "arbitrary_precision"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is often much concern about adding new dependencies to arrow - however, this feature does not seem to add any new dependencies: https://github.com/serde-rs/json/blob/master/Cargo.toml#L74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this feature because we need to deserialize i128 numbers
https://github.com/serde-rs/json/blob/master/src/number.rs#L534
arrow/src/array/array_primitive.rs
Outdated
@@ -624,6 +625,22 @@ mod tests { | |||
assert!(arr.is_null(1)); | |||
assert_eq!(-5, arr.value(2)); | |||
assert_eq!(-5, arr.values()[2]); | |||
|
|||
// a month_day_nano interval contains months, days and nanoseconds, | |||
// but we do not yet have accessors for the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding some ticket explaining what type of accessors would be valuable? Namely month
, day
and nanos
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some TODO comments, PTAL
} else { | ||
let value: u128 = array.value($row) as u128; | ||
|
||
let months_part: i32 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to break this logic out into accessors, as you say, rather than have the field extraction in the stringification. But that could be done as a follow on PR I think
I think it looks good to go. @jorgecarleitao / @nevi-me / @houqp do you have any interest in reviewing this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR, @b41sh , thanks a lot for that!
Two suggestions:
- AFAIK we must activate the IPC integration test before merging to guarantee compatibility with the ecosystem, and the test should pass. This is done by removing this line.
- There is an alternative approach to this on which we use
[i32, i32, i64]
instead ofi128
as the backing container. It makes the API simpler.
AFAIK the first item is a "must" within the arrow project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, reverting my approval based on @jorgecarleitao 's comment :) At the very minimum, we should enable the integration test.
A trick I use in arrow2 for this is to have a git patch and apply it to arrow during execution here. An alternative is to create a PR in apache/arrow and point to that PR on our CI here. Once tests pass here, merge this PR without that change here and afterwards the PR in apache/arrow. |
! [remote rejected] interval-MonthDayNano -> interval-MonthDayNano (refusing to allow a Personal Access Token to create or update workflow I don't have permission to modify the files in the workflow directory, can you help me start the integration test? @houqp @jorgecarleitao |
Any news on it? |
I don't have any news |
Just a friendly ping. Can anyone help @b41sh to finish this PR? |
uhm, I can't understand the error message: you are pushing to your own fork of the repo, so you have full control over its CI (it is under your name, right?). Could it be that you are pushing via https and using a personal token that does not have the workflow scope active (on github)? |
Sorry for the long delay, the problem is caused by my github ssh settings and I have fixed it. |
ArrowNativeType
for i128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified the integration tests is running and using Rust. Nice work @b41sh !
https://github.com/apache/arrow-rs/runs/4430618418?check_suite_focus=true
2021-12-06T13:25:33.8748227Z ##########################################################
2021-12-06T13:25:33.8749321Z IPC: C++ producing, Rust consuming
2021-12-06T13:25:33.8749914Z ##########################################################
...
2021-12-06T13:25:34.1225215Z ==========================================================
2021-12-06T13:25:34.1226543Z Testing file /tmp/arrow-integration-u5ctqfpn/generated_interval_mdn.json
2021-12-06T13:25:34.1227227Z ==========================================================
2021-12-06T13:25:34.1227755Z -- Creating binary inputs
2021-12-06T13:25:34.1228284Z -- Validating file
2021-12-06T13:25:34.1228787Z -- Validating stream
2021-12-06T13:25:34.1231126Z ==========================================================
...
I also re-reviewed the code. Thank you for pushing it through
I think we can make improvements as follow ons (e.g adding the accessor for the m, d, and nano fields)
integration-testing/unskip.patch
Outdated
.skip_category('C#') | ||
- .skip_category('JS') | ||
- .skip_category('Rust'), | ||
+ .skip_category('JS'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
so is the intent that after we merge this PR into arrow-rs
we would then go upstream and apply this patch in the arrow repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, once this PR is merged, we can merge #11238 in the arrow repo.
|
||
// a month_day_nano interval contains months, days and nanoseconds, | ||
// but we do not yet have accessors for the values. | ||
// TODO: implement month, day, and nanos access method for month_day_nano. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 When this PR is merged, I will try and file a ticket for adding these accessors -- I think it would be a fairly good "first PR" type change for new contributors
@houqp and @jorgecarleitao are you interested in reviewing / re-reviewing this PR? |
It's is great to implement i128 as ArrowNativeType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR and perseverance, looks great!
I left a couple of comments where I think there are semantic issues, but if this is blocking someone, go ahead.
@@ -243,6 +243,11 @@ pub fn sort_to_indices( | |||
DataType::Interval(IntervalUnit::DayTime) => { | |||
sort_primitive::<IntervalDayTimeType, _>(values, v, n, cmp, &options, limit) | |||
} | |||
DataType::Interval(IntervalUnit::MonthDayNano) => { | |||
sort_primitive::<IntervalMonthDayNanoType, _>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i128 order relationship does not hold for months,days,nanos
. AFAIK month,days,nanos
do not have a partial order relationship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I write a function like sort_month_day_nanos
for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in general the time intervals do not have an natural order without an associated datetime pinning them to a specific time line. The conversion month,days,nanos -> seconds
is lossy because one day is not 24 hours (some days are 23 and others 25).
@@ -348,6 +348,7 @@ make_numeric_type!(Time64MicrosecondType, i64, i64x8, m64x8); | |||
make_numeric_type!(Time64NanosecondType, i64, i64x8, m64x8); | |||
make_numeric_type!(IntervalYearMonthType, i32, i32x16, m32x16); | |||
make_numeric_type!(IntervalDayTimeType, i64, i64x8, m64x8); | |||
make_numeric_type!(IntervalMonthDayNanoType, i128, i128x4, m128x4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semantically, the numerics of an i128
are not the same as the numerics of (months,days,nanos)
since i128 + i128 != (months, days,nanos) + (months, days,nanos)
.
The consequence of defining this type numerically here is that arithmetic kernels will accept this type, but they will yield a semantically incorrect result (e.g. i128 + i128
to sum two intervals of 1 month each).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we can't use IntervalMonthDayNanoType
as a numeric type? But IntervalDayTimeType
is also a numeric type here, the sum of two IntervalDayTimeType
will also produce an incorrect result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, that seems a bug to me.
For this PR, I propose:
Thoughts? |
Can I delete the |
It looks like a rust bug rust-lang/rust#91663 |
Filed #1022 to track CI failure in "nightly" builds |
The rust compiler thing is fixed in #1023 -- I'll try and merge to this PR |
Thanks again @b41sh -- sorry for the delay in merging |
…test for rust arrow-rs has added `MONTH_DAY_NANO` interval type in PR [#779](apache/arrow-rs#779), we need to enable integration tests for it. Closes #11238 from b41sh/rust-month_day_nano_interval Lead-authored-by: b41sh <baishen2009@gmail.com> Co-authored-by: baishen <baishen2009@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Which issue does this PR close?
Closes #732
Rationale for this change
What changes are included in this PR?
Add MONTH_DAY_NANO interval type
Are there any user-facing changes?