-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_parquet()
returns empty list[i64]
or explicitly crashes if use_pyarrow=True
#6428
Comments
What is the crash? Can you read the file with pandas? |
The crash when using If I use pandas it crashes too:
See below. 1/ with pyarrow
from pathlib import Path
import pandas as pd
path = Path("sample.pqt")
df = pd.read_parquet(path, engine="pyarrow")
print(df.head())
Traceback (most recent call last):
File "/home/olivier/GDrive/dev/golang/parquet-go-explo/test.py", line 20, in <module>
df = pd.read_parquet(path, engine="pyarrow")
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 503, in read_parquet
return impl.read(
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 251, in read
result = self.api.parquet.read_table(
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2871, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 2517, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 332, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2661, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder. 2/ with fastparquet
from pathlib import Path
import pandas as pd
path = Path("sample.pqt")
df = pd.read_parquet(path, engine="fastparquet")
print(df.head())
Traceback (most recent call last):
File "/home/olivier/GDrive/dev/golang/parquet-go-explo/test.py", line 21, in <module>
df = pd.read_parquet(path, engine="fastparquet")
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 503, in read_parquet
return impl.read(
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/pandas/io/parquet.py", line 358, in read
return parquet_file.to_pandas(columns=columns, **kwargs)
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/api.py", line 778, in to_pandas
self.read_row_group_file(rg, columns, categories, index,
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/api.py", line 380, in read_row_group_file
core.read_row_group(
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 621, in read_row_group
read_row_group_arrays(file, rg, columns, categories, schema_helper,
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 591, in read_row_group_arrays
read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 487, in read_col
num += read_data_page_v2(infile, schema_helper, se, ph.data_page_header_v2, cmd,
File "/home/olivier/miniconda3/envs/test/lib/python3.10/site-packages/fastparquet/core.py", line 225, in read_data_page_v2
raise NotImplementedError
NotImplementedError |
Then your parquet file is likely incorrect. |
Well that is what I thought too: the parquet file is corrupt or invalid in some way. 1/
from pathlib import Path
import pyarrow.parquet as pq
path = Path("sample2.pqt")
h = pq.ParquetFile(path)
print("----schema:")
print(h.schema)
print("----read:")
print(h.read())
----read:
pyarrow.Table
name: large_string
age: int64
sex: bool
weight: double
time: timestamp[us]
array: large_list<item: int64>
child 0, item: int64
----
name: [["Masterfog","Armspice"]]
age: [[22,23]]
sex: [[true,false]]
weight: [[51.2,65.3]]
time: [[2023-01-25 12:03:14.208962,2023-01-25 12:03:14.208962]]
array: [[[10,20],[11,22]]] 2/ In this example each lib can read their own nested list field but not that of the other. They return empty lists instead. 3/ So I think there must be subtle differences in the writing to parquet. |
I'll add that one (of the many) benefit(s) of polars over pandas is the capability to hold lists and structs in cells. So this parquet issue is not negligible if you use python/polars in hybrid (multi language) data pipelines. |
In fact, and contrary to what I wrote above, polars and parquet-go produce incompatible parquet formats - for nested fields only. However the bridge does not seem completely insurmountable. See parquet-go/issues/468 for the discussion. @ritchie46, any opinion on the subject ? |
This likely relates to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists and relates to readers such as parquet2 not handling the backwards compatibility rules correctly. For reference the logic to handle this in parquet can be found here and here |
After some trial and error, I found a way to read a polars produced parquet file in go, then save it as parquet and load the latter parquet file to a polars dataframe. See https://github.com/oscar6echo/parquet-polars-go I'll copy my conclusion:
|
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I try to read a sample parquet file produced by another language (golang).
It works fine except for type list[int64] which returns empty while it is not on disk.
If I force
use_pyarrow=True
then it explicitly crashes.NOTE: It seems somewhat connected to issue #6289 though this one is about reading parquet file while the other is about writing them.
1/ version 1:
2/ version 2:
Reproducible example
Expected behavior
The last column in the parquet file (list[int64]) should be returned by
pl.ready_parquet()
:Repo oscar6echo/parquet-go-explo contains the code to produce this parquet file.
Installed versions
The text was updated successfully, but these errors were encountered: