Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read parquet TimestampLogicalTypeAnnotation that is not adjusted to UTC #3588

Closed
devinrsmith opened this issue Mar 22, 2023 · 7 comments · Fixed by #4421
Closed
Assignees
Labels
bug Something isn't working parquet Related to the Parquet integration
Milestone

Comments

@devinrsmith
Copy link
Member

In some situations, we are unable to read parquet files that have timezones (even if it's just reading into DateTime after applying offset).

The general structure for generating these parquet files was a python virtual env (pip install pandas pyarrow fastparquet) with

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "a": list("abc"),
        "b": list(range(1, 4)),
        "c": np.arange(3, 6).astype("u1"),
        "d": np.arange(4.0, 7.0, dtype="float64"),
        "e": [True, False, True],
        "f": pd.date_range("20130101", periods=3),
        "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
        "h": pd.Categorical(list("abc")),
        "i": pd.Categorical(list("abc"), ordered=True),
    }
)

df.to_parquet("pyarrow.parquet", engine='pyarrow', compression=None)
df.to_parquet("fastparquet.parquet", engine='fastparquet', compression=None)

Deephaven is able to read pyarrow.parquet, but not fastparquet.parquet:

io.deephaven.UncheckedDeephavenException: Unable to read column [f]: TimestampLogicalType, isAdjustedToUTC=false, unit=NANOS not supported
	at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:275)
	at java.base/java.util.Optional.orElseThrow(Optional.java:408)
	at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:269)
	at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:647)
	at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:384)
	at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:77)

Pandas is able to successfully read both.

Here is some of the schema information:

D SELECT * FROM PARQUET_SCHEMA('pyarrow.parquet');
┌─────────────────┬─────────┬────────────┬─────────────┬─────────────────┬──────────────┬──────────────────┬───────┬───────────┬──────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│    file_name    │  name   │    type    │ type_length │ repetition_type │ num_children │  converted_type  │ scale │ precision │ field_id │                                            logical_type                                             │
│     varchar     │ varchar │  varchar   │   varchar   │     varchar     │    int64     │     varchar      │ int64 │   int64   │  int64   │                                               varchar                                               │
├─────────────────┼─────────┼────────────┼─────────────┼─────────────────┼──────────────┼──────────────────┼───────┼───────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ pyarrow.parquet │ schema  │ BOOLEAN    │ 0           │ REQUIRED        │            9 │ UTF8             │     0 │         0 │        0 │                                                                                                     │
│ pyarrow.parquet │ a       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │ StringType()                                                                                        │
│ pyarrow.parquet │ b       │ INT64      │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │                                                                                                     │
│ pyarrow.parquet │ c       │ INT32      │ 0           │ OPTIONAL        │            0 │ UINT_8           │     0 │         0 │        0 │ IntType(bitWidth, isSigned=0)                                                                      │
│ pyarrow.parquet │ d       │ DOUBLE     │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │                                                                                                     │
│ pyarrow.parquet │ e       │ BOOLEAN    │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │                                                                                                     │
│ pyarrow.parquet │ f       │ INT64      │ 0           │ OPTIONAL        │            0 │ TIMESTAMP_MICROS │     0 │         0 │        0 │ TimestampType(isAdjustedToUTC=0, unit=TimeUnit(MILLIS=<null>, MICROS=MicroSeconds(), NANOS=<null>)) │
│ pyarrow.parquet │ g       │ INT64      │ 0           │ OPTIONAL        │            0 │ TIMESTAMP_MICROS │     0 │         0 │        0 │ TimestampType(isAdjustedToUTC=1, unit=TimeUnit(MILLIS=<null>, MICROS=MicroSeconds(), NANOS=<null>)) │
│ pyarrow.parquet │ h       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │ StringType()                                                                                        │
│ pyarrow.parquet │ i       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8             │     0 │         0 │        0 │ StringType()                                                                                        │
├─────────────────┴─────────┴────────────┴─────────────┴─────────────────┴──────────────┴──────────────────┴───────┴───────────┴──────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                                                                                                                                            11 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D SELECT * FROM PARQUET_SCHEMA('fastparquet.parquet');
┌─────────────────────┬─────────┬────────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
│      file_name      │  name   │    type    │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │                                            logical_type                                            │
│       varchar       │ varchar │  varchar   │   varchar   │     varchar     │    int64     │    varchar     │ int64 │   int64   │  int64   │                                              varchar                                               │
├─────────────────────┼─────────┼────────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ fastparquet.parquet │ schema  │ BOOLEAN    │ 0           │ REQUIRED        │            9 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ a       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ b       │ INT64      │ 64          │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ c       │ INT32      │ 8           │ OPTIONAL        │            0 │ UINT_8         │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ d       │ DOUBLE     │ 64          │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ e       │ BOOLEAN    │ 1           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ f       │ INT64      │ 0           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │ TimestampType(isAdjustedToUTC=0, unit=TimeUnit(MILLIS=<null>, MICROS=<null>, NANOS=NanoSeconds())) │
│ fastparquet.parquet │ g       │ INT64      │ 0           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │ TimestampType(isAdjustedToUTC=1, unit=TimeUnit(MILLIS=<null>, MICROS=<null>, NANOS=NanoSeconds())) │
│ fastparquet.parquet │ h       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
│ fastparquet.parquet │ i       │ BYTE_ARRAY │ 0           │ OPTIONAL        │            0 │ UTF8           │     0 │         0 │        0 │                                                                                                    │
├─────────────────────┴─────────┴────────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 10 rows                                                                                                                                                                                                                             11 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
@devinrsmith devinrsmith added bug Something isn't working parquet Related to the Parquet integration labels Mar 22, 2023
@devinrsmith devinrsmith added this to the Backlog milestone Mar 22, 2023
@devinrsmith
Copy link
Member Author

I'm not sure why pyarrow.parquet works even though the logical type is the same (maybe there are earlier code paths that make different decisions...).

@devinrsmith
Copy link
Member Author

Potentially related to #3367 - but still unsure why it works today w/ pyarrow parquet format, but not fastparquet format.

@devinrsmith
Copy link
Member Author

Potentially related to which parquet format / implementation version is used. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table

@kzk2000
Copy link

kzk2000 commented Apr 4, 2023

@devinrsmith I believe this is the same parquet file version issue:

io.deephaven.UncheckedDeephavenException: Unable to read column [f]: TimestampLogicalType, isAdjustedToUTC=false, unit=NANOS not supported at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:275)

Per https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table docs, only parquet file version=2.6 supports nanosecond timestamps, here's the comment from the 'version' flag:

version{“1.0”, “2.4”, “2.6”}, default “2.4”
[...] Nanosecond timestamps are only available with version ‘2.6’. [...]

So, as 1st check, I'd try to switch the Parquet Java wrapper to version 2.6 and see if that might resolve it.

@kzk2000
Copy link

kzk2000 commented Apr 4, 2023

Deephaven is able to read pyarrow.parquet, but not fastparquet.parquet

My hunch is that DH cannot read parquet file version=2.6 (which might be the default for fastparquet?). pyarrow still defaults to version=2.4 and hence DH might be able to read those in Java wrapper.

That is, try writing the pyarrow parquet files with version='2.6
df.to_parquet("pyarrow.parquet", engine='pyarrow', compression=None, version='2.6')
My hunch tells me that in that case DH won't be able to read it either

@devinrsmith
Copy link
Member Author

Related to #976 ?

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Aug 30, 2023

The main issue here is that "isAdjustedToUTC=false" timestamps are not supported by our code because we don't support timezones, as stated in #976. Interestingly, files generated by both pyarrow and fastparquet have isAdjustedToUTC set as false. But our code crashes in case of fastparquet with an unclear exception and in case of pyarrow incorrectly assumes the timestamp to be UTC.
Reason being in pyarrow case, the schema field ConvertedType is set. This takes a different code path where a bad assumption exists (as explained in #976).

This assumption can lead to incorrect data being shown by deephaven. For example, this is how pyarrow and deephaven read the column f in pyarrow.parquet file:

    Pyarrow                Deephaven
 2013-01-01          2012-12-31T19:00:00.000
 2013-01-02          2013-01-01T19:00:00.000
 2013-01-03          2013-01-02T19:00:00.000

I think we can add a proper exception for non-UTC adjusted timestamp fields so that both the above cases fail consistently and properly. We can later add support for timezones as part of #976.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants