-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark #7958
Comments
@waitingkuo let me know your thoughts |
How does one do this? Can you provide an example of such a file or the commands needed to create such a file? |
ts.snappy.zip
|
@alamb @comphead ❯ select to_timestamp(1), arrow_typeof(to_timestamp(1));
+------------------------+--------------------------------------+
| to_timestamp(Int64(1)) | arrow_typeof(to_timestamp(Int64(1))) |
+------------------------+--------------------------------------+
| 1970-01-01T00:00:01 | Timestamp(Second, None) |
+------------------------+--------------------------------------+
1 row in set. Query took 0.004 seconds.
❯ select 1::timestamp, arrow_typeof(1::timestamp);
+-------------------------------+-----------------------------+
| Int64(1) | arrow_typeof(Int64(1)) |
+-------------------------------+-----------------------------+
| 1970-01-01T00:00:00.000000001 | Timestamp(Nanosecond, None) |
+-------------------------------+-----------------------------+ |
@waitingkuo thanks
|
So far, In [8]: pa.array([1]).cast(pa.timestamp('ns'))
Out[8]:
<pyarrow.lib.TimestampArray object at 0x11969d2e0>
[
1970-01-01 00:00:00.000000001
] @alamb @avantgardnerio @liukun4515 any thoughts about this? |
Parquet has a proper Timestamp field, correct? https://learn.microsoft.com/en-us/common-data-model/sdk/parquet-to-cdm-datatype And that datatype supports a So I presume this error only happens when reading an |
I agree I took a look at what
It seem to use a different type and has extra metadata that is not present in an equivalent file created by datafusion:
The field is not read as a timestamp at all 🤔
And the metadata / type information is different than spark: $ parquet-tools schema -d ts-df.parquet
message arrow_schema {
required int64 a;
}
creator: datafusion version 32.0.0
file schema: arrow_schema
--------------------------------------------------------------------------------
a: REQUIRED INT64 R:0 D:0
row group 1: RC:1 TS:63 OFFSET:4
--------------------------------------------------------------------------------
a: INT64 ZSTD DO:4 FPO:35 SZ:81/63/0.78 VC:1 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[min: -62125747200, max: -62125747200, num_nulls not defined] |
to_timestamp()
wrong value reading from parquet
Looks like the analysis shows DF has a bunch of issues with timestamp type
|
Interesting explanation in Snowflake of the same issue Key takeaways
That is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark. What are your thoughts? @alamb @waitingkuo @tustvold @viirya |
I agree with the conclusion in the snowflake document, we should follow the specification and be consistent with other compliant implementations. As an aside this type has been deprecated for almost a decade, why is spark still using it... |
Spark community tried (https://issues.apache.org/jira/browse/SPARK-27528) to change default Parquet timestamp type to TIMESTAMP_MICROS but the change was reverted back to INT96 later for ecosystem compatibility (https://issues.apache.org/jira/browse/SPARK-31639). |
Added to list on #7958 |
Can I give a shoot at this? |
I think it would be worthwhile writing up a description of any proposed change first. It isn't necessarily clear to me how one handles this correctly, or even if we can't simply follow Snowflake's example and close this as "won't fix". Perhaps @comphead might be able to help out here? |
Thanks @edmondop please hold up on this. |
Describe the bug
DF reads parquet timestamp datatype as nanos from parquet file whereas DuckDb and Spark treats timestamp datatype as seconds
To Reproduce
create a parquet file with timestamp value -62125747200 and read it back
DuckDb or Spark reads the value correctly
but DF reads timestamps as nanos and provides the wrong answer
Expected behavior
Behavior should be the same
Additional context
No response
The text was updated successfully, but these errors were encountered: