-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full support for time32 and time64 literal values (ScalarValue
)
#4156
Full support for time32 and time64 literal values (ScalarValue
)
#4156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andre-cc-natzka -- this looks great. I will plan to review this carefully in the morning but after a quick skim this looks very nice
cc @waitingkuo and @avantgardnerio
datafusion/common/src/scalar.rs
Outdated
@@ -2100,15 +2188,16 @@ impl TryFrom<ScalarValue> for i32 { | |||
} | |||
} | |||
|
|||
// special implementation for i64 because of TimeNanosecond | |||
// special implementation for i64 because of Date64, Time64, Time64 and Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems like Time64
is repeated
// special implementation for i64 because of Date64, Time64, Time64 and Timestamp | |
// special implementation for i64 because of Date64, Time64, and Timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @isidentical!
datafusion/proto/src/to_proto.rs
Outdated
@@ -1188,7 +1188,28 @@ impl TryFrom<&ScalarValue> for protobuf::ScalarValue { | |||
}) | |||
} | |||
|
|||
datafusion::scalar::ScalarValue::Time64(v) => { | |||
// Since the protos only support Time64 and always interpret it to nanosecond accuracy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is there a particular reason why we can't represent these types directly in our serialization format other than backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, in fact I would guess we wanted to represent the other types as well, i.e. protos would support Time32Second
, Time32Millisecond
, Time64Microsecond
, Time64Nanosecond
. I didn't want to change the protos before discussing them with you first. I would suggest the following changes:
enum PrimitiveScalarType{
[...]
INTERVAL_MONTHDAYNANO = 29;
BINARY = 25;
LARGE_BINARY = 26;
TIME32 = 27;
TIME64 = 28;
}
and
message ScalarValue{
oneof value {
[...]
ScalarTime32Value time32_value = 30;
ScalarTime64Value time64_value = 31;
IntervalMonthDayNanoValue interval_month_day_nano = 32;
StructValue struct_value = 33;
}
}
with the new messages:
message ScalarTime32Value {
oneof value {
int32 time32_second_value = 1;
int32 time32_millisecond_value = 2;
};
}
message ScalarTime64Value {
oneof value {
int64 time64_microsecond_value = 1;
int64 time64_nanosecond_value = 2;
};
}
Then, I would change the to_proto.rs
and from_proto.rs
files consistently. What do you think, @isidentical, @alamb, @waitingkuo, @avantgardnerio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks reasonable to me. It also corresponds to arrow's definition: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L214-L231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waynexia! I will thus push my most up-to-date version where I implement these changes, with consistent modifications in to_proto.rs
and from_proto.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have worried about backwards compatibility in serialization before -- I recommend we simply make the protobuf definitions match the changes to ScalarValue
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, @alamb.
Thank you for your prompt appreciation, @alamb! I am awaiting your more detailed review and willing to address your comments. For the time being there was already important point raised and I left a suggestion to change |
@andre-cc-natzka I also highly recommend rebasing against the current master branch, since there were a couple of changes in how we process primitive scalars. |
hi @andre-cc-natzka this looks great, thanks! it'll be great if we could have a least one test cases for each new data types (i.e. time64(micro), time32(millis), time32(seconds).) |
ScalarValue
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andre-cc-natzka for this PR. I would say this PR is great and could be merged as is (once the conflicts are resolved). I have filed #4171 for supporting time / date and have linked it to this PR.
Also thank you @isidentical and @waitingkuo for the review.
I agree with @waitingkuo that we need to add some end to end sql tests before we claim full "support" for time and date types.
Basic tests I would recommend for the different time types
- Aggregate (min(time), )
- Group by time
- filter 'foo' = time
As @waitingkuo mentions, we could probably follow the model here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/timestamp.rs#L1555-L1571
@@ -824,6 +854,30 @@ mod tests { | |||
Operator::Lt, | |||
DataType::Date64 | |||
); | |||
test_coercion_binary_rule!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice
I double checked that this casting is supported in arrow-rs, which they are ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for checking this, @alamb!
@alamb @andre-cc-natzka ➜ datafusion-cli git:(Support_Time32_and_Time64) ✗ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.48s
Running `target/debug/datafusion-cli`
DataFusion CLI v13.0.0
❯ select 86400000000000::time;
thread 'main' panicked at 'invalid time', /Users/willy/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.22/src/naive/time/mod.rs:422:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ➜ datafusion-cli git:(Support_Time32_and_Time64) ✗ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.46s
Running `target/debug/datafusion-cli`
DataFusion CLI v13.0.0
❯ select (-1)::time;
thread 'main' panicked at 'invalid time', /Users/willy/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.22/src/naive/time/mod.rs:422:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace |
Thanks to @alamb and @waitingkuo for your hints! I will implement the missing tests and ask you if I have questions early next week, then I'll follow @isidentical's suggestion and rebase this into the current master branch, solve conflicts and merge! |
8b8e075
to
5d147d9
Compare
So, I have added the tests required by @alamb and I rebased the branch on top of the most recent version of the master. I had a few troubles while rebasing but I could solve them and everything seems to be working fine now. I will now wait for your final revision (and approval of workflows) before merge. Thanks for your support! |
FWIW I reran those examples from @waitingkuo using datafusion-cli on this branch and they no longer panics (not sure if it is this PR or something else).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost perfect @andre-cc-natzka -- thank you so much
I think the only thing it needs prior to merge is to sort out datafusion/core/src/physical_plan/hash_utils.rs
which doesn't seem right
Otherwise I think it is good to go 🚀
I would be happy to help sort out the merge conflicts if that would help.
datafusion/core/tests/sql/select.rs
Outdated
@@ -814,6 +814,8 @@ async fn query_on_string_dictionary() -> Result<()> { | |||
]; | |||
assert_batches_eq!(expected, &actual); | |||
|
|||
// filtering with Time32 and Time64 types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems out of place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that one escaped me somehow
ScalarValue::Time64(Some(0)), | ||
ScalarValue::Time64(Some(i64::MAX)), | ||
ScalarValue::Time64(None), | ||
ScalarValue::Date64(Some(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -0,0 +1,830 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this file -- it looks like maybe it was previously moved into physical-expr and maybe a merge conflict caused it to end up here?
find . -name 'hash_utils.rs'
./datafusion/core/src/physical_plan/hash_utils.rs
./datafusion/physical-expr/src/hash_utils.rs
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @alamb! I have been looking into the merges and I think that is exactly what happened. Indeed, this file was present in version 13, but it is no longer there. I had to change /datafusion/core/src/physical_plan/joins/hash_join.rs
and I ended up modifying datafusion/core/src/physical_plan/hash_utils.rs
as well. Then I wasn't sure if this file had been eliminated or not, and while solving the merge conflicts I thought it had not been, but I was wrong.
It seems clear to me now that this file has to be removed and everything will be solved. But maybe I am missing something? I will push the new version without this file, please let me know what you think.
And thank you very much for your words and for going through my changes! We are happy to cooperate with you!
I have made the changes and pushed the new version. I would expect this to be ready to merge now, however I am a bit puzzled with the failed Rust workflow runs. There are also 101 failed tests and 340 passing tests, which is the same as the result one gets when running the most recent version of the master, so that makes me think that this is potentially ok? |
I suspect this has to do with perhaps not having submodules checked out (do the tests pass with THanks for the work here @andre-cc-natzka -- I have this PR on my radar and want to help push it over the line (aka get the CI clean and passing). I will do so but may not have a chance to do for another day or two |
Thanks for the hint, @alamb! I have done it and all tests are passing now. Actually there was one failing due to an experiment I had made and had forgotten it, now it is fixed! Thanks for helping with the merge, and no problem at all, I'll wait for you to be available. Having said that, I am looking forward to it! |
I took the liberty of merging master into this PR, resolving the conflicts, and updating a chrono usage that was deprecated. I plan to merge it once the CI checks have completed. Thanks again @andre-cc-natzka |
CI failure due to #4294 (not this PR) |
Thanks again @andre-cc-natzka |
Benchmark runs are scheduled for baseline = 0bb5d4d and contender = bea645c. bea645c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thank you very much for your commitment into making this through, @alamb! I am very happy with this outcome. In the meanwhile, I've realized that type coercion is not supported for a ( As I mention there, unfortunately Arrow does not support type coercion for ( I have opened a PR with the suggested changes #4312 and if you agree we could add that feature as well. |
Which issue does this PR close?
To my knowledge, this PR does not tackle an existing issue.Re #4171
Rationale for this change
Time coercion did not support
Time
types, while it did supportDate
andTimestamp
. It would be useful to support bothTime32
andTime64
for type coercion as well, thus ensuring that, for instance, a pair composed by aUtf8
and aTime32
is coerced to aTime32
. The implementation has to take into account the validTimeUnit
for bothTime32
andTime64
and implement them consistently throughout the code. Additionally, it would also be useful to implement support forTime32
andTime64
inhash_join
andhash_utils
.What changes are included in this PR?
A series of adaptations were implemented throughout the code as to consistently support
Time32(TimeUnit::Second)
,Time32(TimeUnit::Millisecond)
,Time64(TimeUnit::Microsecond)
andTime64(TimeUnit::Nanosecond)
. The changes are mostly localized in thebinary.rs
file fromtype_coercion
, but support for time types also included numerous changes toscalar.rs
from thecommon
crate and some modules from thephysical_expr
crate.Although most changes are straightforward, some non-trivial modifications were added in the proto crate. The
datafusion.proto
file is not changed, so it keeps supporting only a genericTime64
type, with no unit specification, interpreting it to nanosecond accuracy. Thedatafusion/proto/src/to_proto.rs
anddatafusion/proto/src/from_proto.rs
files are adapted consistently, as the protoTime64
translates into aTime64Nanosecond
. On the other hand, a protoTime64
can be obtained from all fourTime64Nanosecond
,Time64Microsecond
,Time32Millisecond
andTime32Second
scalar value types, as implemented indatafusion/proto/src/to_proto.rs
.Are these changes tested?
Yes, the changes in
type_coercion
are tested in thebinary.rs
module, where the existing tests have been extended to account for the cases of type coercion involvingTime32
andTime64
types, and tests are passing.Are there any user-facing changes?
No.