You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
I run a daily data processes in python 3.10 and Docker. I create around 500 parquet files a day with the same process. However, once a week (or 1 in 2500 files) a random file gets corrupted with the following error log when trying to read from pyarrow 10.0.1:
I am not sure where to start and what could be the root cause.
Some questions that might help you to get to a reproducible example, or might give some pointers of the direction to look for:
Can you identify the file for which it fails, and does it then fail reproducible with this file? If you can identify the file, can you also trace it back to the data that was used to create this file?
Can you share such a file?
How do you create the parquet files? (using pyarrow?)
I can close the issue - I just discovered by chance that in very rare circumstances two processes we writing to the same file at the same time. Sorry about this.
Describe the bug, including details regarding any error messages, version, and platform.
I run a daily data processes in python 3.10 and Docker. I create around 500 parquet files a day with the same process. However, once a week (or 1 in 2500 files) a random file gets corrupted with the following error log when trying to read from pyarrow 10.0.1:
When I try to open the file with pandas (and engine='fastparquet') I get the following error:
I am not sure where to start and what could be the root cause.
Component(s)
Python
Versions:
INSTALLED VERSIONS
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-40-generic
Version : #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.7
dateutil : 2.8.2
setuptools : 64.0.2
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.45
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: