BUG: to_json memory leak (introduced in 1.1.0) #43877

vernetya · 2021-10-04T11:05:41Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# any loop
for _ in range(1000)
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
    df.to_json() # same regardless orient or using file

Issue Description

It looks like a memory leak when calling to_json introduced in version 1.1.0. It seems it prevents the dataframe to be correctly garbage collected. Here's a memory profile Pandas 1.1.0 compared to the previous version 1.0.5:

I have the same trends either on Windows 10, Linux Ubuntu, python 3.7, 3.8 & 3.9.
This leak is still there on latest Pandas version 1.3.3 and is proportional to the size of the dataframe. I've tried direct calls to del and gc.collect() but it doesn't change anything.

It's specific to to_json method. I haven't observed leak with other formats such as CSV.

I don't know if it makes sense or help, here's an output using tracemalloc from this code:

def foo():
    df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(5)})
    df.to_json()


if __name__ == "__main__":
    tracemalloc.start(50)

    foo()

    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('traceback')

    # pick the biggest memory block
    stat = top_stats[0]
    print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
    for line in stat.traceback.format():
        print(line)

with Pandas 1.1.0 or 1.3.3:

5 memory blocks: 782.5 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2571
    storage_options=storage_options,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 122
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 183
    indent=self.indent,
  File ".\ven37\lib\site-packages\pandas\core\indexes\base.py", line 4367
    return self._data
  File ".\ven37\lib\site-packages\pandas\core\indexes\range.py", line 186
    return np.arange(self.start, self.stop, self.step, dtype=np.int64)

whereas 1.0.5 produces this:

9 memory blocks: 1.6 KiB
  File "main.py"
    foo()
  File "main.py"
    df.to_json()
  File ".\ven37\lib\site-packages\pandas\core\generic.py", line 2364
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 85
    indent=indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 145
    self.indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 245
    indent,
  File ".\ven37\lib\site-packages\pandas\io\json\_json.py", line 167
    indent=indent,

Expected Behavior

No leak expected, similar to version 1.0.5

Installed Versions

Versions with leak:
master -----------------
commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...

1.3.3 ------------------
commit : 73c6825
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
pandas : 1.3.3
numpy : 1.21.2
...

1.1.0 ------------------
commit : d9fff27
pandas : 1.1.0
numpy : 1.21.2
...

Versions without leak:
commit : None
pandas : 1.0.5
numpy : 1.21.2
...

The text was updated successfully, but these errors were encountered:

jreback · 2021-10-04T11:14:16Z

pls check master as well

phofl · 2021-10-04T20:52:01Z

persists on master

vernetya · 2021-10-05T09:30:33Z

Hi,

yes, I still see the same leak on master :

INSTALLED VERSIONS

commit : 6599834
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.4.0.dev0+833.g6599834103
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
...

jbrockmendel · 2021-10-21T21:30:25Z

cc @WillAyd

WillAyd · 2021-10-21T21:47:12Z

Thanks for the report. Any chance you’ve tried running this with Val grind? https://pandas.pydata.org/pandas-docs/stable/development/debugging_extensions.html

…

Sent from my iPhone

On Oct 21, 2021, at 2:30 PM, jbrockmendel ***@***.***> wrote: cc @WillAyd — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

asmodehn · 2021-10-25T15:04:59Z

I did run valgrind on this script with python 3.9.5 and the current master version of pandas (commit 9018d327de) :

import pandas as pd
import numpy as np

df = pd.DataFrame({str(c): np.random.random_sample(size=100_000) for c in range(10)})
df.to_json()

There are quite a few leaks reported , although I am not familiar with valgrind, so I can't be sure what to conclude from the results.
I am only showing here the (possible) leaks that are directly related with to_json:

[...]
==19968== 400 bytes in 1 blocks are possibly lost in loss record 14,852 of 16,824
==19968==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==19968==    by 0x2928DB: _PyObject_GC_Alloc (gcmodule.c:2237)
==19968==    by 0x2928DB: _PyObject_GC_Malloc (gcmodule.c:2264)
==19968==    by 0x2928DB: _PyObject_GC_NewVar (gcmodule.c:2293)
==19968==    by 0x1643C3: frame_alloc (frameobject.c:790)
==19968==    by 0x1643C3: _PyFrame_New_NoTrack (frameobject.c:885)
==19968==    by 0x163D8C: function_code_fastcall (call.c:319)
==19968==    by 0x31E65D: _PyObject_VectorcallTstate (abstract.h:118)
==19968==    by 0x31E65D: PyObject_CallOneArg (abstract.h:188)
==19968==    by 0x31E65D: property_descr_get (descrobject.c:1573)
==19968==    by 0x1B6F09: _PyObject_GenericGetAttrWithDict (object.c:1201)
==19968==    by 0x1B6F09: PyObject_GenericGetAttr (object.c:1280)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:795)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:786)
==19968==    by 0x32CBBEA5: get_values (objToJSON.c:224)
==19968==    by 0x32CC078A: Object_beginTypeContext (objToJSON.c:1763)
==19968==    by 0x32CB9239: encode (ultrajsonenc.c:966)
==19968==    by 0x32CB9D83: JSON_EncodeObject (ultrajsonenc.c:1190)
==19968==    by 0x32CC170D: objToJSON (objToJSON.c:2089)
[...]
==19968== 416 bytes in 1 blocks are possibly lost in loss record 14,873 of 16,824
==19968==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==19968==    by 0x2928DB: _PyObject_GC_Alloc (gcmodule.c:2237)
==19968==    by 0x2928DB: _PyObject_GC_Malloc (gcmodule.c:2264)
==19968==    by 0x2928DB: _PyObject_GC_NewVar (gcmodule.c:2293)
==19968==    by 0x1643C3: frame_alloc (frameobject.c:790)
==19968==    by 0x1643C3: _PyFrame_New_NoTrack (frameobject.c:885)
==19968==    by 0x163D8C: function_code_fastcall (call.c:319)
==19968==    by 0x31E65D: _PyObject_VectorcallTstate (abstract.h:118)
==19968==    by 0x31E65D: PyObject_CallOneArg (abstract.h:188)
==19968==    by 0x31E65D: property_descr_get (descrobject.c:1573)
==19968==    by 0x1B6F09: _PyObject_GenericGetAttrWithDict (object.c:1201)
==19968==    by 0x1B6F09: PyObject_GenericGetAttr (object.c:1280)
==19968==    by 0x1D8F7C: slot_tp_getattr_hook (typeobject.c:6778)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:795)
==19968==    by 0x1B84CA: PyObject_GetAttrString (object.c:786)
==19968==    by 0x32CC05B5: Object_beginTypeContext (objToJSON.c:1723)
==19968==    by 0x32CB9239: encode (ultrajsonenc.c:966)
==19968==    by 0x32CB9D83: JSON_EncodeObject (ultrajsonenc.c:1190)
==19968==    by 0x32CC170D: objToJSON (objToJSON.c:2089)
[...]
==19968== LEAK SUMMARY:
==19968==    definitely lost: 33,072 bytes in 198 blocks
==19968==    indirectly lost: 12,064 bytes in 159 blocks
==19968==      possibly lost: 14,899,972 bytes in 96,619 blocks
==19968==    still reachable: 1,195,913 bytes in 8,073 blocks
==19968==                       of which reachable via heuristic:
==19968==                         stdstring          : 2,484 bytes in 62 blocks
==19968==                         multipleinheritance: 992 bytes in 12 blocks
==19968==         suppressed: 0 bytes in 0 blocks
==19968== Reachable blocks (those to which a pointer was found) are not shown.
==19968== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==19968==
==19968== For lists of detected and suppressed errors, rerun with: -s
==19968== ERROR SUMMARY: 15290 errors from 15287 contexts (suppressed: 8 from 4)

WillAyd · 2021-10-25T15:41:41Z

At least one of the leaks appears on line 1763. Between that and line 1780 in the code things look suspicious. Can likely be refactored:

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 1763 in d8440f1

values = get_values(tmpObj);

Line 1723 is also a culprit from what you've shared, though I don't think that was refactored any time around 1.1.0

WillAyd · 2021-10-25T15:41:56Z

Thanks for running that by the way!

vernetya · 2021-10-27T09:02:24Z

yep, thanks for running it.

FYI it's reproducible without using numpy when creating the dataframe:

import pandas as pd

for _ in range(1000):
        df = pd.DataFrame({str(c): list(range(100_000)) for c in range(10)})
        df.to_json()

WillAyd · 2021-10-27T15:44:33Z

Cool nice even more minimal example. So yea I am 99% sure the problem is the call to get_values on 1763 is never released and also duplicative of the call on line 1780. Either releasing 1763 or refactoring so get_values doesn't get called twice should help

vernetya · 2022-01-20T11:24:54Z

Hi @WillAyd
I tried to have a look up. I may have found the culprit. I created a pull request. Could you have a look ?

Thanks

vernetya added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2021

mzeitlin11 added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 9, 2021

mzeitlin11 added this to the Contributions Welcome milestone Oct 9, 2021

vernetya mentioned this issue Jan 20, 2022

Bug tojson memleak #45489

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Jan 21, 2022

jreback closed this as completed in #45489 Jan 22, 2022

TendouArisu mentioned this issue Mar 1, 2024

Potential performance issue: .to_json memory leak in pandas below 1.4 version EpistasisLab/Aliro#642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_json memory leak (introduced in 1.1.0) #43877

BUG: to_json memory leak (introduced in 1.1.0) #43877

vernetya commented Oct 4, 2021 •

edited

Loading

jreback commented Oct 4, 2021

phofl commented Oct 4, 2021

vernetya commented Oct 5, 2021

jbrockmendel commented Oct 21, 2021

WillAyd commented Oct 21, 2021 via email

asmodehn commented Oct 25, 2021

WillAyd commented Oct 25, 2021

WillAyd commented Oct 25, 2021

vernetya commented Oct 27, 2021

WillAyd commented Oct 27, 2021 •

edited

Loading

vernetya commented Jan 20, 2022

BUG: to_json memory leak (introduced in 1.1.0) #43877

BUG: to_json memory leak (introduced in 1.1.0) #43877

Comments

vernetya commented Oct 4, 2021 • edited Loading

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

jreback commented Oct 4, 2021

phofl commented Oct 4, 2021

vernetya commented Oct 5, 2021

INSTALLED VERSIONS

jbrockmendel commented Oct 21, 2021

WillAyd commented Oct 21, 2021 via email

asmodehn commented Oct 25, 2021

WillAyd commented Oct 25, 2021

WillAyd commented Oct 25, 2021

vernetya commented Oct 27, 2021

WillAyd commented Oct 27, 2021 • edited Loading

vernetya commented Jan 20, 2022

vernetya commented Oct 4, 2021 •

edited

Loading

WillAyd commented Oct 27, 2021 •

edited

Loading