Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

Closed
MMCMA opened this issue Jan 2, 2023 · 2 comments
Closed

[Python] OSError: Couldn't deserialize thrift: TProtocolException #15153

MMCMA opened this issue Jan 2, 2023 · 2 comments

Comments

@MMCMA
Copy link

MMCMA commented Jan 2, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I run a daily data processes in python 3.10 and Docker. I create around 500 parquet files a day with the same process. However, once a week (or 1 in 2500 files) a random file gets corrupted with the following error log when trying to read from pyarrow 10.0.1:

OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

When I try to open the file with pandas (and engine='fastparquet') I get the following error:

  File "fastparquet\cencoding.pyx", line 336, in fastparquet.cencoding.NumpyIO.read
TypeError: an integer is required

I am not sure where to start and what could be the root cause.

Component(s)

Python

Versions:

INSTALLED VERSIONS

commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-40-generic
Version : #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 64.0.2
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@jorisvandenbossche
Copy link
Member

I am not sure where to start and what could be the root cause.

Some questions that might help you to get to a reproducible example, or might give some pointers of the direction to look for:

  • Can you identify the file for which it fails, and does it then fail reproducible with this file? If you can identify the file, can you also trace it back to the data that was used to create this file?
  • Can you share such a file?
  • How do you create the parquet files? (using pyarrow?)

@kou kou changed the title OSError: Couldn't deserialize thrift: TProtocolException [Python] OSError: Couldn't deserialize thrift: TProtocolException Jan 19, 2023
@MMCMA
Copy link
Author

MMCMA commented Jan 19, 2023

I can close the issue - I just discovered by chance that in very rare circumstances two processes we writing to the same file at the same time. Sorry about this.

@MMCMA MMCMA closed this as completed Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants