Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to_timestamp with 2 arguments (timestamp format) #5398

Closed
Tracked by #3148
alamb opened this issue Feb 25, 2023 · 11 comments · Fixed by #8886
Closed
Tracked by #3148

Support to_timestamp with 2 arguments (timestamp format) #5398

alamb opened this issue Feb 25, 2023 · 11 comments · Fixed by #8886
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 25, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion strives to be postgres compatible, so when there are differences between postgres and datafusion it is confusing

Today, one argument form of to_timestamp works in datafusion:

select to_timestamp('2014-05-11 01:07:34.779000+01:00');
+-------------------------------------------------------+
| totimestamp(Utf8("2014-05-11 01:07:34.779000+01:00")) |
+-------------------------------------------------------+
| 2014-05-11T00:07:34.779                               |
+-------------------------------------------------------+
1 row in set. Query took 0.000 seconds.

But the 2 argument form does not:

select to_timestamp('2014-05-11 01:07:34.779000+01:00', 'YYYY-MM-DD HH:MI:SS.NS');
Plan("Coercion from [Utf8, Utf8] to the signature Uniform(1, [Int64, Timestamp(Nanosecond, None), Timestamp(Microsecond, None), Timestamp(Millisecond, None), Timestamp(Second, None), Utf8]) failed.")

Describe the solution you'd like

Implement the two argument form of to_timestamp as described in

https://www.postgresql.org/docs/current/functions-formatting.html

to_timestamp ( text, text ) → timestamp with time zone

Converts string to time stamp according to the given format. (See also to_timestamp(double precision) in Table 9.33.)

postgres=# select to_timestamp('2014-05-11 01:07:34.779000+01:00', 'YYYY-MM-DD HH:MI:SS.NS');
      to_timestamp
------------------------
 2014-05-11 01:07:34+00
(1 row)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Thank you to @cannonpalms and @idclark

@alamb alamb added the enhancement New feature or request label Feb 25, 2023
@alamb alamb changed the title to_timestamp Support to_timestamp with 2 arguments (timestamp format) Feb 25, 2023
@waitingkuo
Copy link
Contributor

hi @alamb

do you think we should following postgresql formatting syntax or chorno's https://docs.rs/chrono/latest/chrono/format/strftime/index.html#specifiers

in your example, the input contains timezone

select to_timestamp('2014-05-11 01:07:34.779000+01:00', 'YYYY-MM-DD HH:MI:SS.NS');
Plan("Coercion from [Utf8, Utf8] to the signature Uniform(1, [Int64, Timestamp(Nanosecond, None), Timestamp(Microsecond, None), Timestamp(Millisecond, None), Timestamp(Second, None), Utf8]) failed.")

postgrseql's to_timestamp return timestamptz but datafusion's returns timestamp
do we want to align postgreql's return type as well?

@alamb
Copy link
Contributor Author

alamb commented Feb 28, 2023

do you think we should following postgresql formatting syntax or chorno's

🤔 that is an excellent question. I think we should follow postgresql ideally though I realize that may be infeasible given we use chrono underneath

postgrseql's to_timestamp return timestamptz but datafusion's returns timestamp

That is also an excellent question @waitingkuo . I don't have an answer but there is more discussion here #686

@alamb
Copy link
Contributor Author

alamb commented Oct 19, 2023

We changed to_timestamp to follow postgres semantics for single arguments here: #7844

@jhorstmann
Copy link
Contributor

Having recently implemented this in our proprietary engine, these were my notes regarding postgres, compared to other pattern syntaxes:

  • Seems very exhaustive but also super complex
  • Features include years with commas, multipe variants of AM/PM indicates, roman numerals, complete sub-syntax for number formatting and padding
  • In my opinion much too complex to implement, we’d need to define a small subset, at which point we can no longer claim postgres compatibility

We ended up implementing a pattern syntax closer to the one supported by Java SimpleDateFormat, supporting a subset of those pattern characters. Similar syntax is supported by Presto/TrinoDB/Athena and so should be familiar to our users. My personal opinion was also that the percent-based syntax from MySQL or chrono looked a bit ugly, and does not make the width or padding of fields obvious.

Internally we however translated the user-facing syntax to chrono format items and used chrono::format::parse

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2023

I would really like to avoid a massive amount of custom postgres compatible format string logic in DataFusion unless there is a super compelling usecase for it. I think it will likely be a large amount of code / tests (though @jhorstmann can correct me if I am wrong0

I would instead prefer to see using the chrono forma strings directly (via chrono::format::parse) perhaps

@Omega359
Copy link
Contributor

When to_date/to_timestamp/etc is used from spark the format is only passingly similar to Java's SimpleDateFormat with enough differences to be annoyingly incompatible except for simple formats ('EEE' is unsupported without additional work, w/W, Y, O, etc). I would not generally recommend the Java format.

@Omega359
Copy link
Contributor

Omega359 commented Jan 13, 2024

Question on implementation detail while implementing this: should this throw an Err if the timestamp cannot be parsed given either the default behaviour or with any provided formats? Current behaviour seems to be be to throw an Err which mirrors postgresql behaviour, whereas with spark for example the behaviour is to return null for any unparseable string.

I'm asking because with date/timestamp string parsing you are often not dealing with machine generated data but rather human generated and there are only so many formats that can be tried before you must just set the data to null and move on. Not having that ability could be a hindrance for some use cases.

A thought I did have would be to change the behaviour via flag that would be documented as part of the api and user guide.

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2024

Question on implementation detail while implementing this: should this throw an Err if the timestamp cannot be parsed given either the default behaviour or with any provided formats? Current behaviour seems to be be to throw an Err which mirrors postgresql behaviour, whereas with spark for example the behaviour is to return null for any unparseable string.

I think we should keep the existing behavior

A thought I did have would be to change the behaviour via flag that would be documented as part of the api and user guide.

I agree a flag (or maybe another function) could be appropriate

This high level observation is that different SQL implementations have different semantics (Decimal handling is another thing that spark seems to do differently). Given DataFusion's built in functions have only one version, we can't mirror both systems.

My long term hope / plan is to pull as many functions out of the core as possible (e.g #8045 ) so that people can more easily customize the behavior. For example, we could have a to_tmestamp that mirrors postgres and a to_timestamp that mirrors spark, and people could choose which one to use for a particular SessionContext

@tustvold
Copy link
Contributor

whereas with spark for example the behaviour is to return null for any unparseable string

We support this in arrow-rs - https://docs.rs/arrow-cast/latest/arrow_cast/cast/struct.CastOptions.html#structfield.safe

@Omega359
Copy link
Contributor

Thanks for your feedback @alamb! Next question: As best as I can tell it's not possible to overload a function to both accept a single string as well as accept a list of strings (no varargs).

This means to me at least that the choice is either a break in the dataframe api (from to_timestamp(col("a")) to to_timestamp(vec![col("a")]) or a new function name. Is an api break acceptable here or is there a way around this?

@alamb
Copy link
Contributor Author

alamb commented Jan 16, 2024

Thanks for your feedback @alamb! Next question: As best as I can tell it's not possible to overload a function to both accept a single string as well as accept a list of strings (no varargs).

I think yo ucould do it with TypeSignature::one_of

Omega359 added a commit to Omega359/arrow-datafusion that referenced this issue Jan 16, 2024
Omega359 added a commit to Omega359/arrow-datafusion that referenced this issue Jan 16, 2024
alamb added a commit that referenced this issue Jan 20, 2024
* Support to_timestamp with chrono formatting #5398

* Updated user guide's to_timestamp to include chrono formatting information #5398

* Minor comment update.

* Small documentation updates for to_timestamp functions.

* Cargo fmt and clippy improvements.

* Switched to assert and unwrap_err based on feedback

* Fixed assert, code compiles and runs as expected now.

* Fix fmt (again).

* Add additional to_timestamp tests covering usage with tables with and without valid formats.

* to_timestamp documentation fixes.

* - Changed internal_err! -> exec_err! for unsupported data type errors.
- Extracted out to_timestamp_impl method to reduce code duplication as per PR feedback.
- Extracted out validate_to_timestamp_data_types to reduce code duplication as per PR feedback.
- Added additional tests for argument validation and invalid arguments.
- Removed unnecessary shim function 'string_to_timestamp_nanos_with_format_shim'

* Resolved merge conflict, updated toStringXXX methods to reflect upstream change

* prettier

* Fix clippy

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants