Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support to_timestamp with optional chrono formats #8886

Merged
merged 19 commits into from
Jan 20, 2024

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes 5398.

Rationale for this change

Adding flexibility for parsing timestamp strings

What changes are included in this PR?

Code, tests and user guide updates.

Are these changes tested?

Tests were added and pass.

Are there any user-facing changes?

dataframe.to_timestamp(..) changed from to_timestamp(col("a")) to to_timestamp(vec![col("a")])

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Jan 16, 2024
@Omega359 Omega359 changed the title 5398 timestamp with formats suppor to_timestamp with optional chrono formats Jan 16, 2024
@Omega359 Omega359 changed the title suppor to_timestamp with optional chrono formats support to_timestamp with optional chrono formats Jan 16, 2024
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 I'll go through it tomorrow if no one else beats me


# verify timestamp data with formatting options
query PPPPPP
SELECT to_timestamp(null, '%+'), to_timestamp(0, '%s'), to_timestamp(1926632005, '%s'), to_timestamp(1, '%+', '%s'), to_timestamp(-1, '%c', '%+', '%s'), to_timestamp(0-1, '%c', '%+', '%s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb as you are the ticket initiator, is it a valid query? I checked https://www.postgresql.org/docs/current/functions-formatting.html I dont see examples like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries are just mirrors of the existing tests that were there for to_timestamp without formatting options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key observation here is that this PR implements different semantics than any existing to_timestamp (it isn't postgres format strings, nor is it spark format strings, it is something datafusion specific based on the rust chrono format strings)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 -- I started going through this PR and it looks quite nice 👌

I haven't completed by review of datafusion/physical-expr/src/datetime_expressions.rs yet but I plan to do so later today or early tomorrow.

cc @waitingkuo @jhorstmann and @waynexia / @gruuya for your comments

use datafusion::prelude::*;
use datafusion_common::assert_contains;

/// This example demonstrates how to use the to_timestamp function in the DataFrame API as well as via sql.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice 👌

datafusion/expr/src/expr_fn.rs Outdated Show resolved Hide resolved
datafusion/expr/src/built_in_function.rs Show resolved Hide resolved

# verify timestamp data with formatting options
query PPPPPP
SELECT to_timestamp(null, '%+'), to_timestamp(0, '%s'), to_timestamp(1926632005, '%s'), to_timestamp(1, '%+', '%s'), to_timestamp(-1, '%c', '%+', '%s'), to_timestamp(0-1, '%c', '%+', '%s')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key observation here is that this PR implements different semantics than any existing to_timestamp (it isn't postgres format strings, nor is it spark format strings, it is something datafusion specific based on the rust chrono format strings)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great -- thank you @Omega359 🙏

I think we just need some additional test coverage and this PR is ready to go from my perspective

Copy link
Contributor

@gruuya gruuya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks extremely useful, thanks!

datafusion/physical-expr/src/datetime_expressions.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/datetime_expressions.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb I would probably stick to PG standard as we declared. The PR implementation is really cool, and we may want to have a separate function for it? like from_chrono or something

@Omega359
Copy link
Contributor Author

Thanks @alamb I would probably stick to PG standard as we declared. The PR implementation is really cool, and we may want to have a separate function for it? like from_chrono or something

I had a thought for the future to allow for setting the format based on either config or based on dialect. I was thinking that might be best left to after the function refactor (#8045) is done then have separate implementations if desired. I just didn't want to write a parser for the PG syntax as it's pretty detailed and just wanted to get something out there. Someone else noted the complexity of implementing the PG syntax

Do you have happen to have such a parser available?

- Extracted out to_timestamp_impl method to reduce code duplication as per PR feedback.
- Extracted out validate_to_timestamp_data_types to reduce code duplication as per PR feedback.
- Added additional tests for argument validation and invalid arguments.
- Removed unnecessary shim function 'string_to_timestamp_nanos_with_format_shim'
@comphead
Copy link
Contributor

Thanks @alamb I would probably stick to PG standard as we declared. The PR implementation is really cool, and we may want to have a separate function for it? like from_chrono or something

I had a thought for the future to allow for setting the format based on either config or based on dialect. I was thinking that might be best left to after the function refactor (#8045) is done then have separate implementations if desired. I just didn't want to write a parser for the PG syntax as it's pretty detailed and just wanted to get something out there. Someone else noted the complexity of implementing the PG syntax

Do you have happen to have such a parser available?

I'm not sure if its available, but we prob use your approach to create an adapter to PG formatting? 🤔
So you map PG formatting rules into chrono formatting rules

@Omega359
Copy link
Contributor Author

Omega359 commented Jan 19, 2024 via email

@comphead
Copy link
Contributor

comphead commented Jan 19, 2024

I'll leave the final decision to @alamb although he already approved.
My point it would be unexpected to users to have to_timestamp behaving differently comparing to PG. I prefer a separate method for from_chrono conversions, but if we decide to go with current approach we need to provide more documentation and examples for users in scalar_functions.md. Ideally even to have a separate doc page with examples

@alamb
Copy link
Contributor

alamb commented Jan 19, 2024

I took the liberty of merging up from main and running prettier to get CI clean.

@alamb
Copy link
Contributor

alamb commented Jan 19, 2024

Thank you again @Omega359 @comphead and @gruuya

I'll leave the final decision to @alamb although he already approved. My point it would be unexpected to users to have to_timestamp behaving differently comparing to PG. I prefer a separate method for from_chrono conversions,

I agree in an ideal world, to_timestamp formatting would behave identically to an existing implementation (e.g. spark or postgres). However, as has been noted this is a non trivial amount of work. I filed #8915 to track this item

but if we decide to go with current approach we need to provide more documentation and examples for users in scalar_functions.md. Ideally even to have a separate doc page with examples

I agree this would be really nice to add

All in all I think this functionality is important enough to provide users ways to parse timestamps with alternate formatting, even if the supported format strings aren't the ideal form.

@Omega359
Copy link
Contributor Author

... if we decide to go with current approach we need to provide more documentation and examples for users in scalar_functions.md. Ideally even to have a separate doc page with examples

I did add a fairly extensive example sql however unless a user knows where to look they won't find see it. Speaking from experience adding more extensive documentation for functions would be very helpful for someone starting up with datafusion (& rust). I'll be happy to update the docs with a brief example and a link to the example sql - perhaps that could be a pattern for all functions?

@alamb
Copy link
Contributor

alamb commented Jan 20, 2024

I'll be happy to update the docs with a brief example and a link to the example sql - perhaps that could be a pattern for all functions?

That would be amazing @Omega359 -- and really helpful. I have found that when there are good existing patterns subsequent contributors follow them, but getting the initial pattern set is the tough part.

@alamb alamb merged commit e7c0482 into apache:main Jan 20, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 20, 2024

Thanks again @Omega359 and @comphead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support to_timestamp with 2 arguments (timestamp format)
4 participants