-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to read parquet TimestampLogicalTypeAnnotation that is not adjusted to UTC #3588
Comments
I'm not sure why |
Potentially related to #3367 - but still unsure why it works today w/ pyarrow parquet format, but not fastparquet format. |
Potentially related to which parquet format / implementation version is used. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table |
@devinrsmith I believe this is the same parquet file version issue:
Per https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table docs, only parquet file version=2.6 supports nanosecond timestamps, here's the comment from the 'version' flag:
So, as 1st check, I'd try to switch the Parquet Java wrapper to version 2.6 and see if that might resolve it. |
My hunch is that DH cannot read parquet file version=2.6 (which might be the default for fastparquet?). pyarrow still defaults to version=2.4 and hence DH might be able to read those in Java wrapper. That is, try writing the pyarrow parquet files with version='2.6 |
Related to #976 ? |
The main issue here is that "isAdjustedToUTC=false" timestamps are not supported by our code because we don't support timezones, as stated in #976. Interestingly, files generated by both pyarrow and fastparquet have isAdjustedToUTC set as false. But our code crashes in case of fastparquet with an unclear exception and in case of pyarrow incorrectly assumes the timestamp to be UTC. This assumption can lead to incorrect data being shown by deephaven. For example, this is how pyarrow and deephaven read the column f in pyarrow.parquet file:
I think we can add a proper exception for non-UTC adjusted timestamp fields so that both the above cases fail consistently and properly. We can later add support for timezones as part of #976. |
In some situations, we are unable to read parquet files that have timezones (even if it's just reading into DateTime after applying offset).
The general structure for generating these parquet files was a python virtual env (
pip install pandas pyarrow fastparquet
) withDeephaven is able to read
pyarrow.parquet
, but notfastparquet.parquet
:Pandas is able to successfully read both.
Here is some of the schema information:
The text was updated successfully, but these errors were encountered: