[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

MMCMA · 2023-01-02T15:03:04Z

Describe the bug, including details regarding any error messages, version, and platform.

I run a daily data processes in python 3.10 and Docker. I create around 500 parquet files a day with the same process. However, once a week (or 1 in 2500 files) a random file gets corrupted with the following error log when trying to read from pyarrow 10.0.1:

OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

When I try to open the file with pandas (and engine='fastparquet') I get the following error:

  File "fastparquet\cencoding.pyx", line 336, in fastparquet.cencoding.NumpyIO.read
TypeError: an integer is required

I am not sure where to start and what could be the root cause.

Component(s)

Python

Versions:

INSTALLED VERSIONS

commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-40-generic
Version : #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 64.0.2
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-01-18T17:00:59Z

I am not sure where to start and what could be the root cause.

Some questions that might help you to get to a reproducible example, or might give some pointers of the direction to look for:

Can you identify the file for which it fails, and does it then fail reproducible with this file? If you can identify the file, can you also trace it back to the data that was used to create this file?
Can you share such a file?
How do you create the parquet files? (using pyarrow?)

MMCMA · 2023-01-19T10:08:40Z

I can close the issue - I just discovered by chance that in very rare circumstances two processes we writing to the same file at the same time. Sorry about this.

MMCMA added the Type: bug label Jan 2, 2023

rok added the Component: Python label Jan 8, 2023

jorisvandenbossche added the Component: Parquet label Jan 18, 2023

kou changed the title ~~OSError: Couldn't deserialize thrift: TProtocolException~~ [Python] OSError: Couldn't deserialize thrift: TProtocolException Jan 19, 2023

MMCMA closed this as completed Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

MMCMA commented Jan 2, 2023 •

edited

Loading

jorisvandenbossche commented Jan 18, 2023

MMCMA commented Jan 19, 2023

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

Comments

MMCMA commented Jan 2, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Versions:

INSTALLED VERSIONS

jorisvandenbossche commented Jan 18, 2023

MMCMA commented Jan 19, 2023

MMCMA commented Jan 2, 2023 •

edited

Loading