Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in yt's test suite (Python 3.12.0rc1 to 3.12.0rc3) #109602

Closed
neutrinoceros opened this issue Sep 20, 2023 · 20 comments
Closed

Memory leak in yt's test suite (Python 3.12.0rc1 to 3.12.0rc3) #109602

neutrinoceros opened this issue Sep 20, 2023 · 20 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@neutrinoceros
Copy link
Contributor

neutrinoceros commented Sep 20, 2023

Bug report

Bug description:

yt's test suite started crashing when switched to Python 3.12. rc1 was the earliest that we tried, and we're still seeing this with rc3. When run through GitHub Actions, jobs simply exit with Error: the operation was cancelled, and we get no traceback.

It appeared that this crash can only be reproduced on machines with little RAM available, so we tried profiling memory usage.
Here's an illustrative graph, provided by @yut23, of memory usage VS time for a sample of yt's test suite (all third-party packages in the environment are at the same version in both envs)
memory_usage_comparison

Here's an self-contained reproducer script, using memray to trace memory usage

set -euxo pipefail

git clone https://github.com/yt-project/yt --depth=1

python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install -e "./yt[test]"
python -m pip install memray

# setuptools is inspected at runtime in yt's pytest configuration, but it's not needed otherwise
python -m pip install setuptools

cd yt/
python -m memray run -m pytest yt/data_objects

yt uses weakref quite a lot, so we suspect that the issue here might be reminiscent of #108295 (this was pointed out by @Xarthisius).

CPython versions tested on:

3.12

Operating systems tested on:

Linux, macOS

@mdboom
Copy link
Contributor

mdboom commented Sep 20, 2023

I attempted to bisect where this is coming from, but it's challenging since so many of the dependencies break along the 3.11-to-3.12 path.

A standalone, simple reproducer would be very helpful here. Perhaps the memray data offers some clues?

@Xarthisius
Copy link

I attempted to bisect where this is coming from, but it's challenging since so many of the dependencies break along the 3.11-to-3.12 path.

A standalone, simple reproducer would be very helpful here. Perhaps the memray data offers some clues?

I'm getting there. A simpler case, but still relying on yt:

import numpy as np
from yt import load_uniform_grid


def foo():
    arr = np.ones((16, 16 ,16))
    data = dict(density=arr)
    bbox = np.array([[-1, 1], [-1, 1], [-1, 1]])
    ds = load_uniform_grid(data, arr.shape, 1.0, bbox=bbox)

    data_source = ds.all_data()
    field = "density"
    min_val, max_val = data_source[field].min() / 2, data_source[field].max() / 2
    return min_val, max_val

for i in range(1000):
    foo()

@mdboom
Copy link
Contributor

mdboom commented Sep 20, 2023

Thanks, this is really helpful.

@neutrinoceros
Copy link
Contributor Author

All I'm actually able to get from memray is that numpy arrays are the thing we're allocating. So far so good, but what I'd like to know is where memory should be freed but isn't, and I'm not sure that memray can help with that.

@mdboom
Copy link
Contributor

mdboom commented Sep 20, 2023

FWIW if you are using a Python from one of the python.org macOS installers or from MacPorts, you shouldn't encounter this as they include a private copy of ncurses.

Was this intended for another issue? I'm having trouble connecting the dots (as I don't think ncurses is involved in this bug).

@ned-deily
Copy link
Member

Oops, so that's where that comment went! Sorry!

@neutrinoceros
Copy link
Contributor Author

Update: I've spent a couple hours trying to reduce the reproducer and trying to slowly remove yt from it (easier said than done !)

I discovered that commenting this single line in yt made the issue disappear
https://github.com/yt-project/yt/blob/95e70702d35460ff39e1341fcfe191bb139af5ef/yt/data_objects/static_output.py#L695

So we're not there yet, but that felt like a hint worth sharing.

@mdboom
Copy link
Contributor

mdboom commented Sep 22, 2023

Let me share my findings as well. I have a script that prints out "unreachable garbage" at the end of the run (objects where the reference count is larger than len(gc.get_referrers(obj)), indicating some sort of reference counting bug). The only unreachable garbage on 3.12 that's not on 3.11 is a bunch of traceback objects (which are hanging on to a bunch of other objects, hence consuming a lot of memory). So that concurs with your finding, @neutrinoceros, since load_all_plugins() internally raises and catches a lot of exceptions in its normal course of business.

Trying to turn that into a pure Python reproducer has also alluded me: simply raising and catching a bunch of exceptions isn't enough, but is probably somehow related to object cycles and/or weak references getting involved as well. I think a pure Python reproducer will unlock the answer, because then we can git bisect to find the breakage -- as it stands now, I can't get numpy/yt and the various other dependencies to build across the range of 3.12a1 to now.

@AA-Turner AA-Turner added performance Performance or resource usage 3.12 bugs and security fixes 3.13 bugs and security fixes labels Sep 22, 2023
@neutrinoceros
Copy link
Contributor Author

I am getting close to a yt-less reproducer, though for now I am not trying to eliminate numpy from it. Of course having a pure Python reproducer would be even better, but since numpy is a self-contained dependency, I'm wondering if that would be enough. Thoughts @mdboom ?

@neutrinoceros
Copy link
Contributor Author

neutrinoceros commented Sep 24, 2023

Here's my reduced reproducer https://github.com/neutrinoceros/reprod_cpython_bug_109602
(I'm tagging the current version no-yt in case I can refine it further later)

In it's current state, it still depends on numpy at runtime and on Cython + setuptools at build time, but yt is out, and we're left with ~100 lines of Python and 3 lines of Cython.

I have faith that the remaining dependency on numpy can be eliminated (I think its only effect is to generate big objects), but I wanted to share my progress as removing yt was quite an undertaking.

@neutrinoceros
Copy link
Contributor Author

neutrinoceros commented Sep 24, 2023

I just pushed a new tag no-numpy. The Cython-generated C extension still seems crucial to reproduce, but I'd like to point out that the only reason why I'm using Cython is that I don't know (yet) how to write the minimal C code myself. I also don't know how compatible Cython is with the 3.12 branch. Please let me know if it's still a blocker to the bisection.

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

Thanks a lot, @neutrinoceros. I'll have a look today.

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

I think I have found the bug in Cython's new support for 3.12. It will take me a while to put together a proper bug report -- I just wanted to report out that I think I found it so you don't waste too much time further refining.

@yut23
Copy link

yut23 commented Sep 25, 2023

I finally isolated this to a reference leak in Cython's exception handling for 3.12, added in cython/cython#5442. It occurs when Cython adds a traceback frame to an exception that was raised in Python code. New minimal reproducer, using the Cython module in https://github.com/neutrinoceros/reprod_cpython_bug_109602:

from collections import UserDict
from mini_yt.lib import bar

def main():
    d = UserDict()
    d["lots of data"] = bytearray(1024**2)
    try:
        # tries to access d["spam"]
        bar(d)
    except KeyError:
        pass

NLOOPS = 40
for i in range(1, NLOOPS + 1):
    main()
    print(f"{i}/{NLOOPS}", end="\r")

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

Yep, that's what I found, too. I'm in the process of writing up a Cython bug and fix. Would you prefer to take it, @yut23?

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

I believe this is the fix [EDITED]:

diff --git a/Cython/Utility/Exceptions.c b/Cython/Utility/Exceptions.c
index 20dc7faf3..ba3bd325f 100644
--- a/Cython/Utility/Exceptions.c
+++ b/Cython/Utility/Exceptions.c
@@ -182,10 +182,12 @@ static CYTHON_INLINE void __Pyx_ErrRestoreInState(PyThreadState *tstate, PyObjec
     CYTHON_UNUSED_VAR(type);
     if (value) {
         #if CYTHON_COMPILING_IN_CPYTHON
-        if (unlikely(((PyBaseExceptionObject*) value)->traceback != tb))
-        #endif
+        if (unlikely(((PyBaseExceptionObject*) value)->traceback != tb)) {
             // If this fails, we may lose the traceback but still set the expected exception below.
             PyException_SetTraceback(value, tb);
+            Py_XDECREF(tb);
+        }
+        #endif
     }
     tmp_value = tstate->current_exception;
     tstate->current_exception = value;

@yut23
Copy link

yut23 commented Sep 25, 2023

Yeah, I'll take care of it later this morning. I think it also needs an extra preprocessor guard and a decref for type, if I'm following the code correctly.

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

Thanks all for helping to get to the bottom of this.

@mdboom
Copy link
Contributor

mdboom commented Sep 25, 2023

Closing this, because it seems to not be a bug in CPython, though feel free to reopen if there's any documentation that should be improved, etc.

@mdboom mdboom closed this as completed Sep 25, 2023
@neutrinoceros
Copy link
Contributor Author

Thanks a lot for your support !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes performance Performance or resource usage type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants