Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Cannot read parquet file with deeply nested structs/lists #1370

Open
ritchie46 opened this issue Jan 21, 2023 · 1 comment
Open

Cannot read parquet file with deeply nested structs/lists #1370

ritchie46 opened this issue Jan 21, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@ritchie46
Copy link
Collaborator

This file is not readable with arrow2, but it is readable in pyarrow.

Error

thread '<unnamed>' panicked at 'invalid or out-of-range datetime', /home/ritchie46/.cargo/git/checkouts/arrow2-945af624853845da/8067ddc/src/temporal_conversions.rs:145:6

Contents

print(pl.read_parquet("part-00000-0c119376-450f-4aa4-8191-6a42b2fe4993-c000.snappy.parquet", use_pyarrow=True))
print(pl.read_parquet("part-00000-0c119376-450f-4aa4-8191-6a42b2fe4993-c000.snappy.parquet", use_pyarrow=True))

shape: (3, 31)
┌────────────┬───────┬───────┬────────────┬─────┬───────────┬────────────┬────────────┬────────────┐
│ resourceTy ┆ id    ┆ meta  ┆ implicitRu ┆ ... ┆ yy__versi ┆ yy__us_cor ┆ yy__us_cor ┆ yy__us_cor │
│ pe         ┆ ---   ┆ ---   ┆ les        ┆     ┆ on        ┆ e_race     ┆ e_ethnicit ┆ e_birthsex │
│ ---        ┆ str   ┆ struc ┆ ---        ┆     ┆ ---       ┆ ---        ┆ y          ┆ ---        │
│ str        ┆       ┆ t[8]  ┆ str        ┆     ┆ i32       ┆ str        ┆ ---        ┆ struct[1]  │
│            ┆       ┆       ┆            ┆     ┆           ┆            ┆ str        ┆            │
╞════════════╪═══════╪═══════╪════════════╪═════╪═══════════╪════════════╪════════════╪════════════╡
│ Patient    ┆ 1735a ┆ {null ┆ null       ┆ ... ┆ null      ┆ null       ┆ null       ┆ {null}     │
│            ┆ g9aa5 ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 0f7b- ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 72f6- ┆ ,2022 ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 502d- ┆ -04-0 ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ b1ee- ┆ 4 02: ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 5e... ┆ 36... ┆            ┆     ┆           ┆            ┆            ┆            │
│ Patient    ┆ 17443 ┆ {null ┆ null       ┆ ... ┆ null      ┆ null       ┆ null       ┆ {null}     │
│            ┆ z2903 ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ -18b1 ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ a0f0- ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 3895- ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 4l0l- ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 9d... ┆ ,n... ┆            ┆     ┆           ┆            ┆            ┆            │
│ Patient    ┆ 17443 ┆ {null ┆ null       ┆ ... ┆ null      ┆ null       ┆ null       ┆ {null}     │
│            ┆ a1663 ┆ ,null ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 7-18b ┆ ,"1", ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ 16f7b ┆ 2020- ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ -3435 ┆ 08-31 ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ -4d0f ┆ 08:28 ┆            ┆     ┆           ┆            ┆            ┆            │
│            ┆ -9... ┆ :...  ┆            ┆     ┆           ┆            ┆            ┆            │
└────────────┴───────┴───────┴────────────┴─────┴───────────┴────────────┴────────────┴────────────┘

Schema

The column that cannot be read is the 'meta' column, containing this schema:

{'meta': Struct([Field('id': Utf8), Field('extension': List(Utf8)), Field('versionId': Utf8), Field('lastUpdated': Datetime(tu='ns', tz=None)), Field('source': Utf8), Field('profile': List(Utf8)), Field('security': List(Struct([Field('id': Utf8), Field('extension': List(Utf8)), Field('system': Utf8), Field('version': Utf8), Field('code': Utf8), Field('display': Utf8), Field('userSelected': Boolean)]))), Field('tag': List(Struct([Field('id': Utf8), Field('extension': List(Utf8)), Field('system': Utf8), Field('version': Utf8), Field('code': Utf8), Field('display': Utf8), Field('userSelected': Boolean)])))])}
@jorgecarleitao
Copy link
Owner

This seems to an issue in parsing a date on the file. from pyarrow, could you share

  • the min and max of lastUpdated in the file?
  • the file with no data

I am trying to get what is the logical types in parquet to see if this is an issue in converting those or a strange date on the data

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants