-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: to_json memory leak (introduced in 1.1.0) #43877
Comments
pls check master as well |
persists on master |
Hi, yes, I still see the same leak on master : INSTALLED VERSIONScommit : 6599834 pandas : 1.4.0.dev0+833.g6599834103 |
cc @WillAyd |
Thanks for the report. Any chance you’ve tried running this with Val grind?
https://pandas.pydata.org/pandas-docs/stable/development/debugging_extensions.html
…Sent from my iPhone
On Oct 21, 2021, at 2:30 PM, jbrockmendel ***@***.***> wrote:
cc @WillAyd
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I did run valgrind on this script with python 3.9.5 and the current master version of pandas (commit
There are quite a few leaks reported , although I am not familiar with valgrind, so I can't be sure what to conclude from the results.
|
At least one of the leaks appears on line 1763. Between that and line 1780 in the code things look suspicious. Can likely be refactored: pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 1763 in d8440f1
Line 1723 is also a culprit from what you've shared, though I don't think that was refactored any time around 1.1.0 |
Thanks for running that by the way! |
yep, thanks for running it. FYI it's reproducible without using numpy when creating the dataframe: import pandas as pd
for _ in range(1000):
df = pd.DataFrame({str(c): list(range(100_000)) for c in range(10)})
df.to_json() |
Cool nice even more minimal example. So yea I am 99% sure the problem is the call to get_values on 1763 is never released and also duplicative of the call on line 1780. Either releasing 1763 or refactoring so get_values doesn't get called twice should help |
Hi @WillAyd Thanks |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
It looks like a memory leak when calling
to_json
introduced in version 1.1.0. It seems it prevents the dataframe to be correctly garbage collected. Here's a memory profile Pandas 1.1.0 compared to the previous version 1.0.5:I have the same trends either on Windows 10, Linux Ubuntu, python 3.7, 3.8 & 3.9.
This leak is still there on latest Pandas version 1.3.3 and is proportional to the size of the dataframe. I've tried direct calls to
del
andgc.collect()
but it doesn't change anything.It's specific to
to_json
method. I haven't observed leak with other formats such as CSV.I don't know if it makes sense or help, here's an output using tracemalloc from this code:
with Pandas 1.1.0 or 1.3.3:
whereas 1.0.5 produces this:
Expected Behavior
No leak expected, similar to version 1.0.5
Installed Versions
Versions with leak:
master -----------------
commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...
1.3.3 ------------------
commit : 73c6825
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.3.3
numpy : 1.21.2
...
1.1.0 ------------------
commit : d9fff27
pandas : 1.1.0
numpy : 1.21.2
...
Versions without leak:
commit : None
pandas : 1.0.5
numpy : 1.21.2
...
The text was updated successfully, but these errors were encountered: