Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support type coercion for timestamp and utf8 #4312

Conversation

andre-cc-natzka
Copy link
Contributor

@andre-cc-natzka andre-cc-natzka commented Nov 21, 2022

Which issue does this PR close?

Closes #4311.

Rationale for this change

Currently DataFusion supports type coercion for (DataType::Time, DataType::Utf8) and (DataType::Date, DataType::Utf8) pairs, but is still missing type coercion for (DataType::Timestamp, DataType::Utf8).

What changes are included in this PR?

Two lines of code are added in the temporal_coercion function from datafusion/expr/src/type_coercion/binary.rs to account for type coercion for a (DataType::Timestamp, DataType::Utf8) pair. The output type is DataType::Timestamp(TimeUnit::Nanosecond, _), because it is the only time unit supported by Arrow, which forces us to change the original time unit to nanoseconds. Although not ideal, this is not a problem, as both the left and right hand side of an expression are converted into DataType::Timestamp(TimeUnit::Nanosecond, _) and the expression can be properly evaluated.

Are these changes tested?

Yes, corresponding test cases are added to the test function test_type_coercion from datafusion/expr/src/type_coercion/binary.rs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Nov 21, 2022
Copy link
Contributor

@waitingkuo waitingkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @andre-cc-natzka

here's my comments:

  1. rebase master, your time32/64 pr is merged, thanks again
  2. i'll suggest align the text with the prettyprint result (what you see in datafusion-cli)
    i.e.
    align the result for
select now()::text;
+-----------------------------------+
| now()                             |
+-----------------------------------+
| 2022-11-21 21:42:52.196874 +00:00 |
+-----------------------------------+

to

select now();
+----------------------------------+
| now()                            |
+----------------------------------+
| 2022-11-21T21:42:48.800795+00:00 |
+----------------------------------+

and

select now()::timestamp::text;
+----------------------------+
| now()                      |
+----------------------------+
| 2022-11-21 21:44:48.386022 |
+----------------------------+
1 row in set. Query took 0.003 seconds.

to

❯ select now()::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2022-11-21T21:44:52.195395 |
+----------------------------+
  1. delete the generated datafusion.rs in the proto folder, it's generated by the prost crate https://github.com/tokio-rs/prost

@@ -1555,7 +1555,7 @@ async fn cast_timestamp_to_timestamptz() -> Result<()> {
#[tokio::test]
async fn test_cast_to_time() -> Result<()> {
let ctx = SessionContext::new();
let sql = "SELECT 0::TIME";
let sql = "SELECT 0::TIME64";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure why do we have this. but select 0::time64 doesn't work for me

ScalarTime64Value time64_value = 30;
IntervalMonthDayNanoValue interval_month_day_nano = 31;
StructValue struct_value = 32;
ScalarFixedSizeBinary fixed_size_binary_value = 34;
>>>>>>> upstream/master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>>>>>> upstream/master

Comment on lines 773 to 786
message ScalarTime32Value {
oneof value {
int32 time32_second_value = 1;
int32 time32_millisecond_value = 2;
};
}

message ScalarTime64Value {
oneof value {
int64 time64_microsecond_value = 1;
int64 time64_nanosecond_value = 2;
};
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you should rebase master, the pr for time32/64 is merged

Comment on lines 817 to 818
// filtering with Time32 and Time64 types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite sure what it is, i guess we need to delete ths

@alamb
Copy link
Contributor

alamb commented Nov 23, 2022

Marking this PR as draft as it still needs some work prior to passing CI

@alamb alamb marked this pull request as draft November 23, 2022 18:02
@github-actions github-actions bot removed the core Core DataFusion crate label Dec 5, 2022
@andre-cc-natzka
Copy link
Contributor Author

Hello @waitingkuo, @alamb. Thanks for your feedback!

It's been a long time, I apologize for that. I took some days of vacation actually. I am also sorry for the state of this PR, I should indeed have rebased master in advance. I have been through the process now and I believe the problems pointed out by @waitingkuo have been solved.

In fact, the changes I implemented here are very simple, and are limited to the ones in datafusion/expr/src/type_coercion/binary.rs, which aim at enabling type coercion to a timestamp (to nanosecond accuracy, since it is the only one supported by Arrow) when a string and a timestamp (with arbitrary precision) are provided. This could be very useful in my opinion.

Thank you again,
André.

@andre-cc-natzka andre-cc-natzka marked this pull request as ready for review December 5, 2022 09:38
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. Thank you @andre-cc-natzka

I would recommend (perhaps as a follow on) to add some SQL level tests but since there is coverage here I think it looks good to me

@waitingkuo
Copy link
Contributor

hi @andre-cc-natzka thank you, it looks good to me.

I suggest that add some SQL level tests

e.g.

select (timestamp '2000-01-01T00:00:00') = '2000-01-01T00:00:00';
+-----------------------------------------------------------+
| Utf8("2000-01-01T00:00:00") = Utf8("2000-01-01T00:00:00") |
+-----------------------------------------------------------+
| true                                                      |
+-----------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

and this should raise error until we could cast TimestampTz to Utf8

select (timestamptz '2000-01-01T00:00:00+00:00') = '2000-01-01T00:00:00+08:00';
NotImplemented("Unsupported CAST from Utf8 to Timestamp(Nanosecond, Some(\"+00:00\"))")

@alamb
Copy link
Contributor

alamb commented Dec 7, 2022

I plan to write some SQL level tests for this feature as they will help us in IOx.

@alamb alamb merged commit 8547fd8 into apache:master Dec 7, 2022
@alamb
Copy link
Contributor

alamb commented Dec 7, 2022

Thanks again @andre-cc-natzka -- here are some SQL level tests #4545

@ursabot
Copy link

ursabot commented Dec 7, 2022

Benchmark runs are scheduled for baseline = fbadebb and contender = 8547fd8. 8547fd8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@andre-cc-natzka
Copy link
Contributor Author

Hello again @alamb , @waitingkuo. Thank you very much for your second revision and for merging the PR!

Thank you @alamb also for adding the SQL tests that test this new functionality, they look really nice!

All the best,
André.

@alamb
Copy link
Contributor

alamb commented Dec 8, 2022

Thank you for the contribution @andre-cc-natzka

@waitingkuo
Copy link
Contributor

thank you @andre-cc-natzka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for type coercion for a (Timestamp, Utf8) pair
4 participants