Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

comphead · 2023-10-27T18:14:51Z

Describe the bug

DF reads parquet timestamp datatype as nanos from parquet file whereas DuckDb and Spark treats timestamp datatype as seconds

To Reproduce

create a parquet file with timestamp value -62125747200 and read it back

DuckDb or Spark reads the value correctly

0001-04-25 00:00:00

but DF reads timestamps as nanos and provides the wrong answer

❯ select * from test;
+-------------------------------+
| a                             |
+-------------------------------+
| 1754-12-22T22:43:41.128654848 |
+-------------------------------+

Expected behavior

Behavior should be the same

Additional context

No response

The text was updated successfully, but these errors were encountered:

comphead · 2023-10-27T18:15:07Z

@waitingkuo let me know your thoughts

comphead · 2023-10-27T20:51:31Z

DuckDB, Spark, PG by default treats timestamp type as microseconds
https://duckdb.org/docs/sql/data_types/timestamp.html
https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/sql/types/TimestampType.html#:~:text=The%20timestamp%20type%20represents%20a,TimestampType%20to%20refer%20the%20type.
https://www.postgresql.org/docs/current/datatype-datetime.html

alamb · 2023-10-27T20:57:47Z

create a parquet file with timestamp value -62125747200 and read it back

How does one do this? Can you provide an example of such a file or the commands needed to create such a file?

comphead · 2023-10-27T21:51:09Z

ts.snappy.zip
Attaching a parquet file.
I created this in spark

spark.sql("select cast(-62125747200 as timestamp)").write.parquet("out")

waitingkuo · 2023-10-29T11:26:03Z

@alamb @comphead
i'm not familiar with parquet parser, i assume that time casting is dealed by timestamp casting but not to_timestamp function.
The behaviors for casting and to_timestamp are now different after #7844 merged

❯ select to_timestamp(1), arrow_typeof(to_timestamp(1));
+------------------------+--------------------------------------+
| to_timestamp(Int64(1)) | arrow_typeof(to_timestamp(Int64(1))) |
+------------------------+--------------------------------------+
| 1970-01-01T00:00:01    | Timestamp(Second, None)              |
+------------------------+--------------------------------------+
1 row in set. Query took 0.004 seconds.

❯ select 1::timestamp, arrow_typeof(1::timestamp);
+-------------------------------+-----------------------------+
| Int64(1)                      | arrow_typeof(Int64(1))      |
+-------------------------------+-----------------------------+
| 1970-01-01T00:00:00.000000001 | Timestamp(Nanosecond, None) |
+-------------------------------+-----------------------------+

comphead · 2023-10-29T18:38:33Z

@waitingkuo thanks

to_timestamp(1) is consistent with PG, DuckDB, Spark.

1::timestamp this syntax is not known for any engine above, is it supposed to do the same as to_timestamp(1), correct?

waitingkuo · 2023-10-30T10:25:07Z

Timestamp in SQL maps to Timestamp(Nanosecond, tz) for now.
https://github.com/apache/arrow-datafusion/blob/9b45967edc6dba312ea223464dad3e66604d2095/datafusion/sql/src/planner.rs#L346

So far, 1::timestamp in datafusion treat int 1 as 1 ns which is same as pyarrow

In [8]: pa.array([1]).cast(pa.timestamp('ns'))
Out[8]: 
<pyarrow.lib.TimestampArray object at 0x11969d2e0>
[
  1970-01-01 00:00:00.000000001
]

@alamb @avantgardnerio @liukun4515 any thoughts about this?

avantgardnerio · 2023-10-30T14:01:06Z

Parquet has a proper Timestamp field, correct? https://learn.microsoft.com/en-us/common-data-model/sdk/parquet-to-cdm-datatype

And that datatype supports a precision field which specifies millis/micros/nanos?

So I presume this error only happens when reading an INT field and casting? If so, I'd 100% be on board copying the behavior of these other well known databases. Folks who stored their data properly as Timestamps shouldn't be affected. It would be a breaking change to those who just stored as an INT, but I think now is the time to make that change since datetime support is relatively new... it will only get harder to change going forward.

alamb · 2023-10-30T17:30:48Z

If so, I'd 100% be on board copying the behavior of these other well known databases

I agree

I took a look at what ts.snappy.parquet contains:

$ parquet-tools schema -d ts.snappy.parquet
message spark_schema {
  required int96 a;
}

creator: parquet-mr version 1.10.99.7.1.7.2000-305 (build eeabcd207c4c506ebd915865772cadb9bac25837)
extra: org.apache.spark.version = 2.4.7
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"timestamp","nullable":false,"metadata":{}}]}

file schema: spark_schema
--------------------------------------------------------------------------------
a: REQUIRED INT96 R:0 D:0

row group 1: RC:1 TS:44 OFFSET:4
--------------------------------------------------------------------------------
a:  INT96 SNAPPY DO:0 FPO:4 SZ:48/44/0.92 VC:1 ENC:BIT_PACKED,PLAIN_DICTIONARY ST:[min: 0x0000000000000000C4441A00, max: 0x0000000000000000C4441A00, num_nulls: 0]

It seem to use a different type and has extra metadata that is not present in an equivalent file created by datafusion:

❯ select to_timestamp_seconds(-62125747200);
+-------------------------------------------+
| to_timestamp_seconds(Int64(-62125747200)) |
+-------------------------------------------+
| 0001-04-25T00:00:00                       |
+-------------------------------------------+
1 row in set. Query took 0.001 seconds.

❯ copy (select to_timestamp_seconds(-62125747200) as "a") to 'ts-df.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.024 seconds.

The field is not read as a timestamp at all 🤔

❯ select * from 'ts-df.parquet';
+--------------+
| a            |
+--------------+
| -62125747200 |
+--------------+
1 row in set. Query took 0.004 seconds.

And the metadata / type information is different than spark:

$ parquet-tools schema -d ts-df.parquet
message arrow_schema {
  required int64 a;
}

creator: datafusion version 32.0.0

file schema: arrow_schema
--------------------------------------------------------------------------------
a: REQUIRED INT64 R:0 D:0

row group 1: RC:1 TS:63 OFFSET:4
--------------------------------------------------------------------------------
a:  INT64 ZSTD DO:4 FPO:35 SZ:81/63/0.78 VC:1 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: -62125747200, max: -62125747200, num_nulls not defined]

comphead · 2023-10-30T21:35:50Z

Looks like the analysis shows DF has a bunch of issues with timestamp type

select cast(1 as timestamp); should be 1970-01-01T00:00:01 not 1970-01-01T00:00:00.000000001
1::timestamp should be consistent with to_timestamp(1) which works correctly now
reading Int96 Timestamp type is not consistent with major engines(this ticket)
writing timestamp from COPY doesn't generate all neccessary parquet information

comphead · 2023-11-02T21:22:36Z

arrow-rs treats INT96 Parquet type as Timestamp(NanoSecond)
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97

Interesting explanation in Snowflake of the same issue
https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file

Key takeaways

INT96 Parquet field is deprecated https://issues.apache.org/jira/browse/PARQUET-323
INT96 is only used to represent nanosec timestamp
Apache projects like Hive and Spark still incorrectly treats the first 16 bytes, hence it returned what users thought was the correct value, but in fact it is incorrect.

That is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark.

What are your thoughts? @alamb @waitingkuo @tustvold @viirya

tustvold · 2023-11-02T21:28:25Z

I agree with the conclusion in the snowflake document, we should follow the specification and be consistent with other compliant implementations.

As an aside this type has been deprecated for almost a decade, why is spark still using it...

viirya · 2023-11-06T22:17:57Z

Spark community tried (https://issues.apache.org/jira/browse/SPARK-27528) to change default Parquet timestamp type to TIMESTAMP_MICROS but the change was reverted back to INT96 later for ecosystem compatibility (https://issues.apache.org/jira/browse/SPARK-31639).

alamb · 2023-11-20T14:59:46Z

Added to list on #7958

edmondop · 2023-11-22T18:00:57Z

Can I give a shoot at this?

tustvold · 2023-11-22T18:06:56Z

I think it would be worthwhile writing up a description of any proposed change first. It isn't necessarily clear to me how one handles this correctly, or even if we can't simply follow Snowflake's example and close this as "won't fix".

Perhaps @comphead might be able to help out here?

comphead · 2023-11-22T18:08:19Z

Thanks @edmondop please hold up on this.
I'm fixing some timestamp literals issues and we will have discussion on timestamp precision in DF, so this ticket will depend on actions above.

comphead added the bug Something isn't working label Oct 27, 2023

alamb changed the title ~~to_timestamp() wrong value reading from parquet~~ Wrong timestamp type read while from parquet file created by spark Oct 30, 2023

comphead self-assigned this Nov 1, 2023

This was referenced Nov 14, 2023

Support float arguments to to_timestamp function #7868

Closed

Fixing issues with for timestamp literals #8193

Merged

alamb mentioned this issue Nov 20, 2023

[EPIC] Additional Date/Time related open issues #8282

Open

23 tasks

tustvold changed the title ~~Wrong timestamp type read while from parquet file created by spark~~ Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark Nov 22, 2023

comphead mentioned this issue Nov 29, 2023

Document timestamp input limits #8369

Merged

alamb closed this as completed in #8369 Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

comphead commented Oct 27, 2023 •

edited

Loading

comphead commented Oct 27, 2023

comphead commented Oct 27, 2023

alamb commented Oct 27, 2023

comphead commented Oct 27, 2023

waitingkuo commented Oct 29, 2023

comphead commented Oct 29, 2023 •

edited

Loading

waitingkuo commented Oct 30, 2023

avantgardnerio commented Oct 30, 2023

alamb commented Oct 30, 2023

comphead commented Oct 30, 2023

comphead commented Nov 2, 2023

tustvold commented Nov 2, 2023

viirya commented Nov 6, 2023 •

edited

Loading

alamb commented Nov 20, 2023

edmondop commented Nov 22, 2023

tustvold commented Nov 22, 2023

comphead commented Nov 22, 2023

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958

Comments

comphead commented Oct 27, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

comphead commented Oct 27, 2023

comphead commented Oct 27, 2023

alamb commented Oct 27, 2023

comphead commented Oct 27, 2023

waitingkuo commented Oct 29, 2023

comphead commented Oct 29, 2023 • edited Loading

waitingkuo commented Oct 30, 2023

avantgardnerio commented Oct 30, 2023

alamb commented Oct 30, 2023

comphead commented Oct 30, 2023

comphead commented Nov 2, 2023

tustvold commented Nov 2, 2023

viirya commented Nov 6, 2023 • edited Loading

alamb commented Nov 20, 2023

edmondop commented Nov 22, 2023

tustvold commented Nov 22, 2023

comphead commented Nov 22, 2023

comphead commented Oct 27, 2023 •

edited

Loading

comphead commented Oct 29, 2023 •

edited

Loading

viirya commented Nov 6, 2023 •

edited

Loading