Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full support for time32 and time64 literal values (ScalarValue) #4156

Merged
merged 14 commits into from
Nov 20, 2022

Conversation

andre-cc-natzka
Copy link
Contributor

@andre-cc-natzka andre-cc-natzka commented Nov 9, 2022

Which issue does this PR close?

To my knowledge, this PR does not tackle an existing issue.
Re #4171

Rationale for this change

Time coercion did not support Time types, while it did support Date and Timestamp. It would be useful to support both Time32 and Time64 for type coercion as well, thus ensuring that, for instance, a pair composed by a Utf8 and a Time32 is coerced to a Time32. The implementation has to take into account the valid TimeUnit for both Time32 and Time64 and implement them consistently throughout the code. Additionally, it would also be useful to implement support for Time32 and Time64 in hash_join and hash_utils.

What changes are included in this PR?

A series of adaptations were implemented throughout the code as to consistently support Time32(TimeUnit::Second), Time32(TimeUnit::Millisecond), Time64(TimeUnit::Microsecond) and Time64(TimeUnit::Nanosecond). The changes are mostly localized in the binary.rs file from type_coercion, but support for time types also included numerous changes to scalar.rs from the common crate and some modules from the physical_expr crate.

Although most changes are straightforward, some non-trivial modifications were added in the proto crate. The datafusion.proto file is not changed, so it keeps supporting only a generic Time64 type, with no unit specification, interpreting it to nanosecond accuracy. The datafusion/proto/src/to_proto.rs and datafusion/proto/src/from_proto.rs files are adapted consistently, as the proto Time64 translates into a Time64Nanosecond. On the other hand, a proto Time64 can be obtained from all four Time64Nanosecond, Time64Microsecond, Time32Millisecond and Time32Second scalar value types, as implemented in datafusion/proto/src/to_proto.rs.

Are these changes tested?

Yes, the changes in type_coercion are tested in the binary.rs module, where the existing tests have been extended to account for the cases of type coercion involving Time32 and Time64 types, and tests are passing.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels Nov 9, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andre-cc-natzka -- this looks great. I will plan to review this carefully in the morning but after a quick skim this looks very nice

cc @waitingkuo and @avantgardnerio

@@ -2100,15 +2188,16 @@ impl TryFrom<ScalarValue> for i32 {
}
}

// special implementation for i64 because of TimeNanosecond
// special implementation for i64 because of Date64, Time64, Time64 and Timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems like Time64 is repeated

Suggested change
// special implementation for i64 because of Date64, Time64, Time64 and Timestamp
// special implementation for i64 because of Date64, Time64, and Timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isidentical!

@@ -1188,7 +1188,28 @@ impl TryFrom<&ScalarValue> for protobuf::ScalarValue {
})
}

datafusion::scalar::ScalarValue::Time64(v) => {
// Since the protos only support Time64 and always interpret it to nanosecond accuracy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is there a particular reason why we can't represent these types directly in our serialization format other than backwards compatibility?

Copy link
Contributor Author

@andre-cc-natzka andre-cc-natzka Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, in fact I would guess we wanted to represent the other types as well, i.e. protos would support Time32Second, Time32Millisecond, Time64Microsecond, Time64Nanosecond. I didn't want to change the protos before discussing them with you first. I would suggest the following changes:

enum PrimitiveScalarType{

    [...]
    INTERVAL_MONTHDAYNANO = 29;

    BINARY = 25;
    LARGE_BINARY = 26;

    TIME32 = 27;
    TIME64 = 28;
}

and

message ScalarValue{
    oneof value {

        [...]

        ScalarTime32Value time32_value = 30;
        ScalarTime64Value time64_value = 31;
        IntervalMonthDayNanoValue interval_month_day_nano = 32;
        StructValue struct_value = 33;
    }
}

with the new messages:

message ScalarTime32Value {
  oneof value {
    int32 time32_second_value = 1;
    int32 time32_millisecond_value = 2;
  };
}

message ScalarTime64Value {
  oneof value {
    int64 time64_microsecond_value = 1;
    int64 time64_nanosecond_value = 2;
  };
}

Then, I would change the to_proto.rs and from_proto.rs files consistently. What do you think, @isidentical, @alamb, @waitingkuo, @avantgardnerio?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable to me. It also corresponds to arrow's definition: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L214-L231

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waynexia! I will thus push my most up-to-date version where I implement these changes, with consistent modifications in to_proto.rs and from_proto.rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have worried about backwards compatibility in serialization before -- I recommend we simply make the protobuf definitions match the changes to ScalarValues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, @alamb.

@andre-cc-natzka
Copy link
Contributor Author

Thank you @andre-cc-natzka -- this looks great. I will plan to review this carefully in the morning but after a quick skim this looks very nice

cc @waitingkuo and @avantgardnerio

Thank you for your prompt appreciation, @alamb! I am awaiting your more detailed review and willing to address your comments. For the time being there was already important point raised and I left a suggestion to change datafusion.proto in the answer.

@isidentical
Copy link
Contributor

@andre-cc-natzka I also highly recommend rebasing against the current master branch, since there were a couple of changes in how we process primitive scalars.

@waitingkuo
Copy link
Contributor

waitingkuo commented Nov 10, 2022

hi @andre-cc-natzka this looks great, thanks!

it'll be great if we could have a least one test cases for each new data types (i.e. time64(micro), time32(millis), time32(seconds).)
right now we only have it for time64(nanos)

https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/timestamp.rs#L1555-L1571

@alamb alamb changed the title Support time32 and time64 Full support for time32 and time64 literal values (ScalarValue) Nov 10, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andre-cc-natzka for this PR. I would say this PR is great and could be merged as is (once the conflicts are resolved). I have filed #4171 for supporting time / date and have linked it to this PR.

Also thank you @isidentical and @waitingkuo for the review.

I agree with @waitingkuo that we need to add some end to end sql tests before we claim full "support" for time and date types.

Basic tests I would recommend for the different time types

  1. Aggregate (min(time), )
  2. Group by time
  3. filter 'foo' = time

As @waitingkuo mentions, we could probably follow the model here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql/timestamp.rs#L1555-L1571

@@ -824,6 +854,30 @@ mod tests {
Operator::Lt,
DataType::Date64
);
test_coercion_binary_rule!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice

I double checked that this casting is supported in arrow-rs, which they are ✅

https://github.com/apache/arrow-rs/blob/e44cb5b478257751916d1674292123eaf5d80a7c/arrow-cast/src/cast.rs#L149-L158

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for checking this, @alamb!

@waitingkuo
Copy link
Contributor

@alamb @andre-cc-natzka
out of bound value make the casting crash, i think it's worth fixing as well. Perhaps another follow on issue/pr to fix

➜  datafusion-cli git:(Support_Time32_and_Time64) ✗ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.48s
     Running `target/debug/datafusion-cli`
DataFusion CLI v13.0.0
❯ select 86400000000000::time;
thread 'main' panicked at 'invalid time', /Users/willy/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.22/src/naive/time/mod.rs:422:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
➜  datafusion-cli git:(Support_Time32_and_Time64) ✗ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.46s
     Running `target/debug/datafusion-cli`
DataFusion CLI v13.0.0
❯ select (-1)::time;
thread 'main' panicked at 'invalid time', /Users/willy/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.22/src/naive/time/mod.rs:422:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@andre-cc-natzka
Copy link
Contributor Author

Thanks to @alamb and @waitingkuo for your hints! I will implement the missing tests and ask you if I have questions early next week, then I'll follow @isidentical's suggestion and rebase this into the current master branch, solve conflicts and merge!

@andre-cc-natzka andre-cc-natzka force-pushed the Support_Time32_and_Time64 branch from 8b8e075 to 5d147d9 Compare November 14, 2022 23:01
@andre-cc-natzka
Copy link
Contributor Author

So, I have added the tests required by @alamb and I rebased the branch on top of the most recent version of the master. I had a few troubles while rebasing but I could solve them and everything seems to be working fine now. I will now wait for your final revision (and approval of workflows) before merge. Thanks for your support!

@alamb
Copy link
Contributor

alamb commented Nov 15, 2022

out of bound value make the casting crash, i think it's worth fixing as well. Perhaps another follow on issue/pr to fix

FWIW I reran those examples from @waitingkuo using datafusion-cli on this branch and they no longer panics (not sure if it is this PR or something else).

cd datafusion-cli
cargo run datafusion-cli
...

DataFusion CLI v14.0.0
❯ select 86400000000000::time;
+-----------------------+
| Int64(86400000000000) |
+-----------------------+
| ERROR CONVERTING DATE |
+-----------------------+
1 row in set. Query took 0.041 seconds.

❯ select (-1)::time;
+-----------------------+
| Int64(-1)             |
+-----------------------+
| ERROR CONVERTING DATE |
+-----------------------+
1 row in set. Query took 0.002 seconds.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost perfect @andre-cc-natzka -- thank you so much

I think the only thing it needs prior to merge is to sort out datafusion/core/src/physical_plan/hash_utils.rs which doesn't seem right

Otherwise I think it is good to go 🚀

I would be happy to help sort out the merge conflicts if that would help.

@@ -814,6 +814,8 @@ async fn query_on_string_dictionary() -> Result<()> {
];
assert_batches_eq!(expected, &actual);

// filtering with Time32 and Time64 types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems out of place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that one escaped me somehow

ScalarValue::Time64(Some(0)),
ScalarValue::Time64(Some(i64::MAX)),
ScalarValue::Time64(None),
ScalarValue::Date64(Some(0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,830 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this file -- it looks like maybe it was previously moved into physical-expr and maybe a merge conflict caused it to end up here?

find . -name 'hash_utils.rs'
./datafusion/core/src/physical_plan/hash_utils.rs
./datafusion/physical-expr/src/hash_utils.rs
``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @alamb! I have been looking into the merges and I think that is exactly what happened. Indeed, this file was present in version 13, but it is no longer there. I had to change /datafusion/core/src/physical_plan/joins/hash_join.rs and I ended up modifying datafusion/core/src/physical_plan/hash_utils.rs as well. Then I wasn't sure if this file had been eliminated or not, and while solving the merge conflicts I thought it had not been, but I was wrong.

It seems clear to me now that this file has to be removed and everything will be solved. But maybe I am missing something? I will push the new version without this file, please let me know what you think.

And thank you very much for your words and for going through my changes! We are happy to cooperate with you!

@andre-cc-natzka
Copy link
Contributor Author

I have made the changes and pushed the new version. I would expect this to be ready to merge now, however I am a bit puzzled with the failed Rust workflow runs. There are also 101 failed tests and 340 passing tests, which is the same as the result one gets when running the most recent version of the master, so that makes me think that this is potentially ok?

@alamb
Copy link
Contributor

alamb commented Nov 16, 2022

There are also 101 failed tests and 340 passing tests, which is the same as the result one gets when running the most recent version of the master, so that makes me think that this is potentially ok?

I suspect this has to do with perhaps not having submodules checked out (do the tests pass with git submodule update --init?

THanks for the work here @andre-cc-natzka -- I have this PR on my radar and want to help push it over the line (aka get the CI clean and passing). I will do so but may not have a chance to do for another day or two

@andre-cc-natzka
Copy link
Contributor Author

Thanks for the hint, @alamb! I have done it and all tests are passing now. Actually there was one failing due to an experiment I had made and had forgotten it, now it is fixed!

Thanks for helping with the merge, and no problem at all, I'll wait for you to be available. Having said that, I am looking forward to it!

@alamb
Copy link
Contributor

alamb commented Nov 20, 2022

I took the liberty of merging master into this PR, resolving the conflicts, and updating a chrono usage that was deprecated. I plan to merge it once the CI checks have completed.

Thanks again @andre-cc-natzka

@alamb
Copy link
Contributor

alamb commented Nov 20, 2022

CI failure due to #4294 (not this PR)

@alamb alamb merged commit bea645c into apache:master Nov 20, 2022
@alamb
Copy link
Contributor

alamb commented Nov 20, 2022

Thanks again @andre-cc-natzka

@ursabot
Copy link

ursabot commented Nov 20, 2022

Benchmark runs are scheduled for baseline = 0bb5d4d and contender = bea645c. bea645c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@andre-cc-natzka
Copy link
Contributor Author

andre-cc-natzka commented Nov 21, 2022

Thank you very much for your commitment into making this through, @alamb! I am very happy with this outcome.

In the meanwhile, I've realized that type coercion is not supported for a (Timestamp, Utf8) pair, which in my opinion would be very interesting and relevant. I looked a bit into this and have a suggestion to add this feature, so I took the freedom to create a new issue #4311 to address it.

As I mention there, unfortunately Arrow does not support type coercion for (Timestamp, Utf8) if the time unit considered is other than nanoseconds, so the output type for type coercion between a string and a timestamp has to be a timestamp to nanosecond accuracy, no matter the original time unit.

I have opened a PR with the suggested changes #4312 and if you agree we could add that feature as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants