Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp overflows for extreme low/high values #8336

Closed
Tracked by #8282
comphead opened this issue Nov 27, 2023 · 5 comments · Fixed by #8369
Closed
Tracked by #8282

Timestamp overflows for extreme low/high values #8336

comphead opened this issue Nov 27, 2023 · 5 comments · Fixed by #8369
Labels
bug Something isn't working

Comments

@comphead
Copy link
Contributor

Describe the bug

Timestamp literal conversion fails to be created from extreme values.

To Reproduce

❯ SELECT to_timestamp(-62125747200);
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Compute error: Overflow happened on: -62125747200 * 1000000000

❯ select to_timestamp(1926632005177);
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Compute error: Overflow happened on: 1926632005177 * 1000000000

Expected behavior

The cast should happen
PG returns

0001-04-25T00:00:00 +63022-07-16T12:59:37

respectfully

Additional context

Part of #8282

@comphead comphead added the bug Something isn't working label Nov 27, 2023
@comphead
Copy link
Contributor Author

The overflow happens because Datafusion treats underlying i64 as nanoseconds which is obviously not big enough.
Other sql engines like PG/Spark/DuckDB treat i64 as microsecond.
Parquet was using deprecated i96 datatype to fit the value.

Some evident options are:

  • treat Timestamp i64 as microsecond, the similar way PG, Spark, DuckDB works. This might involve huge work, as all timestamp conversions, functions expects nanoseconds.
  • Extend Timestamp underlying value to i128 instead of i64 by modifying the code or by conditional compilation.

@alamb @waitingkuo @viirya @tustvold would be nice to hear from you

@tustvold
Copy link
Contributor

tustvold commented Nov 27, 2023

The overflow happens because Datafusion treats underlying i64 as nanoseconds which is obviously not big enough.

I'm confused by this statement, DataFusion follows the arrow data model which has various different timestamp precisions. There is no hard coded mapping to nanoseconds, rather anything from seconds to nanoseconds is supported

@comphead
Copy link
Contributor Author

The overflow happens because Datafusion treats underlying i64 as nanoseconds which is obviously not big enough.

I'm confused by this statement, DataFusion follows the arrow data model which has various different timestamp precisions. There is no hard coded mapping to nanoseconds, rather anything from seconds to nanoseconds is supported

Right, I can see there are bunch of hardcodes to nanoseconds, like cast from string to timestamp will come as nanoseconds, read from parquet, coercions, etc.

@alamb
Copy link
Contributor

alamb commented Nov 28, 2023

Another option is "do nothing"

You can get the expected answer by using to_timestamp_seconds:

❯ select to_timestamp_seconds(-62125747200);
+-------------------------------------------+
| to_timestamp_seconds(Int64(-62125747200)) |
+-------------------------------------------+
| 0001-04-25T00:00:00                       |
+-------------------------------------------+
1 row in set. Query took 0.001 seconds.

I think the core issue is that the SQL type TIMESTAMP mapping (as documented in https://arrow.apache.org/datafusion/user-guide/sql/data_types.html#date-time-types) is:

SQL Type Arrow Type
TIMESTAMP Timestamp(Nanosecond, None)

There is no type in Arrow that can represent nanosecond precision over the same range of values as a i64 second timestamp. So if we switched to something like

SQL Type Arrow Type
TIMESTAMP Timestamp(Second, None)

That would result in lower precision timestamps (can't represent nanosecond precision)

So basically I suggest we do nothing about this particular issue -- there is a workaround and I think it is a fairly uncommon corner case

@comphead
Copy link
Contributor Author

Thanks @alamb @tustvold for your input. as part of this issue I will document behavior with timestamp limits in the user manual, specifying which range is acceptable. We can return to this later if the real user case scenario arises

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants