You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After investigating an uwsgi worker crash we found that a memory leak of our parquet export was the cause.
We use python 3.9 with pyarrow 17.0.0 but this is reproductible with the following code and also with python 3.12.
We found that the problem comes from writer.writer.write_table(table, row_group_size=1) (row_group_size=1 for the example).
If you have any advice on how to reduce or fix this memory leak it will be very appreciated 🙏
How to reproduce?
Install: pip install pyarrow memory_profiler
Then use the following script to generate 100 000 records with an empty {"a": ""}. You will show that the RSS is increasing up to 90mo for only 100 000 records before closing the writer and still stay at +75mo after closing it!
I'm having the same issue here with python 3.11 and pyarrow 19.0.0. Is this related to or another variant of this issue that seems old and has kind of stalled out? Or are they likely different?
It could be related to this issue because of the example I use here. But the issues are not exactly the same.
Here it appears to be a problem with how metadata are handled. Due to the row_group_size set to 1 the metadata are collected and written at the very end of the export. And with a row_group_size set to 1 the metadata collected are huge! It also seems they are not correctly garbaged at the end of the process because the memory usage is not garbaged correctly at the end of the export.
For "fixing" our issues, we set at least a row_group_size around 10 000 to reduce the memory usage and the number of collected metadata. It seems to be almost stable with such values.
Hence, also reducing the size of the footer. I don't have other advice to give.
Summary
After investigating an uwsgi worker crash we found that a memory leak of our parquet export was the cause.
We use python 3.9 with pyarrow 17.0.0 but this is reproductible with the following code and also with python 3.12.
We found that the problem comes from
writer.writer.write_table(table, row_group_size=1)
(row_group_size=1 for the example).If you have any advice on how to reduce or fix this memory leak it will be very appreciated 🙏
How to reproduce?
Install:
pip install pyarrow memory_profiler
Then use the following script to generate 100 000 records with an empty
{"a": ""}
. You will show that the RSS is increasing up to 90mo for only 100 000 records before closing the writer and still stay at +75mo after closing it!Output:
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: