Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128942: make arraymodule.c free-thread safe (lock-free) #130771

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Mar 2, 2025

I added lock-free single element reads and writes by mostly copying the list object's homework. TL;DR: pyperformance scimark seems to be back to about what it was without the free-thread safe stuff (pending confirmation of course). Tried a few other things but the list strategy seems good enough (except for the negative index thing I mentioned in #130744, if that is an issue).

Timings, the relevant ones are "OLD" - non free-thread safe arraymodule, "SLOW" - the previous slower PR and the last two "LFREERW".

### scimark_fft ###
WITHGIL: Mean +- std dev: 200 ms +- 6 ms
OLD:     Mean +- std dev: 238 ms +- 7 ms
SLOW:    Mean +- std dev: 274 ms +- 4 ms
ITEMFIX: Mean +- std dev: 272 ms +- 14 ms
NOLOCK:  Mean +- std dev: 240 ms +- 4 ms
TEST1:   Mean +- std dev: 267 ms +- 4 ms
TEST2:   Mean +- std dev: 255 ms +- 6 ms
LFREERD: Mean +- std dev: 256 ms +- 7 ms   # left in standard Py_SIZE() in getarrayitem
LFREERD: Mean +- std dev: 262 ms +- 7 ms   # atomic read of size in getarrayitem
LFREERD: Mean +- std dev: 270 ms +- 7 ms   # build and run again
LFREERD: Mean +- std dev: 255 ms +- 10 ms  # orthodox
LFREERD: Mean +- std dev: 259 ms +- 8 ms   # build and run again
LFREERW: Mean +- std dev: 239 ms +- 7 ms   # lockfree write single element
LFREERW: Mean +- std dev: 242 ms +- 6 ms   # build and run again
STATICL: Mean +- std dev: 223 ms +- 6 ms   # statically linked
STATICL: Mean +- std dev: 223 ms +- 5 ms   # build and run again
CLEANUP: Mean +- std dev: 224 ms +- 5 ms   # including atomic get/set
AMEMCPY: Mean +- std dev: 226 ms +- 6 ms   # atomic item copy on resize (instead of memcpy)
AMEMCPY: Mean +- std dev: 222 ms +- 5 ms   # build and run again
LOCKWR:  Mean +- std dev: 226 ms +- 4 ms   # locked writes again, but much better with static linking
LOCKWR:  Mean +- std dev: 226 ms +- 5 ms   # build and run again
AMCAGG:  Mean +- std dev: 222 ms +- 7 ms   # aggregate memcpy (just using best build)

### scimark_lu ###
WITHGIL: Mean +- std dev: 64.6 ms +- 1.9 ms
OLD:     Mean +- std dev: 89.0 ms +- 2.1 ms
SLOW:    Mean +- std dev: 95.4 ms +- 3.1 ms
ITEMFIX: Mean +- std dev: 92.0 ms +- 2.1 ms
NOLOCK:  Mean +- std dev: 88.5 ms +- 2.5 ms
TEST1:   Mean +- std dev: 89.7 ms +- 2.1 ms
TEST2:   Mean +- std dev: 90.5 ms +- 2.4 ms
LFREERD: Mean +- std dev: 89.9 ms +- 2.7 ms
LFREERD: Mean +- std dev: 90.9 ms +- 2.6 ms
LFREERD: Mean +- std dev: 93.4 ms +- 4.0 ms
LFREERD: Mean +- std dev: 91.9 ms +- 3.7 ms
LFREERD: Mean +- std dev: 91.0 ms +- 5.0 ms
LFREERW: Mean +- std dev: 89.2 ms +- 2.8 ms
LFREERW: Mean +- std dev: 88.8 ms +- 2.5 ms
STATICL: Mean +- std dev: 86.9 ms +- 1.9 ms
STATICL: Mean +- std dev: 87.1 ms +- 2.2 ms
CLEANUP: Mean +- std dev: 88.9 ms +- 2.5 ms
AMEMCPY: Mean +- std dev: 87.9 ms +- 1.9 ms
AMEMCPY: Mean +- std dev: 88.7 ms +- 4.8 ms
LOCKWR:  Mean +- std dev: 88.0 ms +- 2.4 ms
LOCKWR:  Mean +- std dev: 88.5 ms +- 3.5 ms
AMCAGG:  Mean +- std dev: 88.5 ms +- 3.5 ms

### scimark_monte_carlo ###
WITHGIL: Mean +- std dev: 40.1 ms +- 2.3 ms
OLD:     Mean +- std dev: 53.4 ms +- 1.0 ms
SLOW:    Mean +- std dev: 56.8 ms +- 1.4 ms
ITEMFIX: Mean +- std dev: 55.9 ms +- 0.7 ms
NOLOCK:  Mean +- std dev: 54.4 ms +- 1.3 ms
TEST1:   Mean +- std dev: 54.1 ms +- 0.6 ms
TEST2:   Mean +- std dev: 54.1 ms +- 0.6 ms
LFREERD: Mean +- std dev: 54.3 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.9 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.2 ms +- 1.5 ms
LFREERW: Mean +- std dev: 53.2 ms +- 1.1 ms
LFREERW: Mean +- std dev: 53.9 ms +- 1.5 ms
STATICL: Mean +- std dev: 55.0 ms +- 1.9 ms
STATICL: Mean +- std dev: 53.3 ms +- 1.0 ms
CLEANUP: Mean +- std dev: 52.6 ms +- 0.5 ms
AMEMCPY: Mean +- std dev: 53.3 ms +- 0.6 ms
AMEMCPY: Mean +- std dev: 53.4 ms +- 1.2 ms
LOCKWR:  Mean +- std dev: 55.4 ms +- 4.0 ms
LOCKWR:  Mean +- std dev: 54.8 ms +- 1.8 ms
AMCAGG:  Mean +- std dev: 53.1 ms +- 1.0 ms

### scimark_sor ###
WITHGIL: Mean +- std dev: 70.3 ms +- 3.3 ms
OLD:     Mean +- std dev: 98.1 ms +- 1.8 ms
SLOW:    Mean +- std dev: 99.9 ms +- 1.1 ms
ITEMFIX: Mean +- std dev:  100 ms +- 1 ms
NOLOCK:  Mean +- std dev: 97.3 ms +- 1.4 ms
TEST1:   Mean +- std dev: 97.5 ms +- 0.7 ms
TEST2:   Mean +- std dev: 98.5 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.0 ms +- 2.0 ms
LFREERD: Mean +- std dev: 98.2 ms +- 1.4 ms
LFREERD: Mean +- std dev: 98.4 ms +- 2.7 ms
LFREERD: Mean +- std dev: 97.6 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.9 ms +- 1.8 ms
LFREERW: Mean +- std dev: 97.2 ms +- 1.6 ms
LFREERW: Mean +- std dev: 96.9 ms +- 2.0 ms
STATICL: Mean +- std dev: 97.8 ms +- 1.9 ms
STATICL: Mean +- std dev: 97.2 ms +- 1.7 ms
CLEANUP: Mean +- std dev: 97.7 ms +- 1.2 ms
AMEMCPY: Mean +- std dev: 96.0 ms +- 2.0 ms
AMEMCPY: Mean +- std dev: 96.4 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 97.7 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 96.8 ms +- 2.5 ms
AMCAGG:  Mean +- std dev: 95.6 ms +- 0.5 ms

### scimark_sparse_mat_mult ###
WITHGIL: Mean +- std dev: 2.89 ms +- 0.11 ms
OLD:     Mean +- std dev: 3.94 ms +- 0.15 ms
SLOW:    Mean +- std dev: 4.68 ms +- 0.07 ms
ITEMFIX: Mean +- std dev: 4.34 ms +- 0.07 ms
NOLOCK:  Mean +- std dev: 3.94 ms +- 0.58 ms
TEST1:   Mean +- std dev: 3.99 ms +- 0.12 ms
TEST2:   Mean +- std dev: 3.80 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.84 ms +- 0.17 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.15 ms
LFREERD: Mean +- std dev: 3.94 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.74 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.84 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.87 ms +- 0.17 ms
STATICL: Mean +- std dev: 3.37 ms +- 0.08 ms
STATICL: Mean +- std dev: 3.49 ms +- 0.12 ms
CLEANUP: Mean +- std dev: 3.45 ms +- 0.14 ms
AMEMCPY: Mean +- std dev: 3.44 ms +- 0.13 ms
AMEMCPY: Mean +- std dev: 3.46 ms +- 0.14 ms
LOCKWR:  Mean +- std dev: 3.52 ms +- 0.47 ms
LOCKWR:  Mean +- std dev: 3.44 ms +- 0.10 ms
AMCAGG:  Mean +- std dev: 3.38 ms +- 0.09 ms

@tom-pytel
Copy link
Contributor Author

ping @colesbury

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Disclaimer: I'm not an expert on the FT list implementation, so take some of my comments with a grain of salt.

Seeing good single-threaded performance is nice, but what about multi-threaded scaling? The number of locks that are still here scare me a little--it would be nice if this scaled well for concurrent use as well, especially for operations that don't require concurrent writes (e.g., comparisons and copies).

@tom-pytel
Copy link
Contributor Author

Note, this is not ready to go, there is the memory issue which needs resolving.

@tom-pytel
Copy link
Contributor Author

@ZeroIntensity you can remove the do-not-merge, its not an arraymodule issue, its a QSBR issue, see #130794.

@tom-pytel
Copy link
Contributor Author

The main thing here for acceptance is a benchmark run which I am not able to start (I only did local pyperformance check against main), so someone with access will have to initiate that to compare with main.

@colesbury colesbury self-requested a review March 4, 2025 16:22
Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten a chance to look through arraymodule.c yet. I'll review that later this week.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach here seems good. A few comments below.

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 5, 2025

The actual arraymodule cleanup is fine, just one thing you should pay attention to when looking at this:

data->items is not a deref but pointer arithmetic so is safe to use if data is NULL in memcopies and _PyBytes_Repeat when associated size is 0. But there are places where this might not be safe which are accomodated, specifically:

  • array_array_buffer_info_impl
  • array_array_tobytes_impl
  • array_array_fromunicode_impl
  • array_array_tounicode_impl
  • array_buffer_getbuf_lock_held

Are there any other places where this needs to take place?

Its the test and trying to run it with parallel-threads with tsan that's problematic.

  • warnings doesn't play nice with --parallel-threads, found a least evil solution but its not wonderful. This still gets "env changed" but at least can run. Is there a way to check for parallel-threads other than this?
    def setUp(self):
        if (not support.Py_GIL_DISABLED or
                any('"parallel_threads": null' in a for a in sys.argv) or
                all('parallel_threads' not in a for a in sys.argv)):
            self.enterContext(warnings.catch_warnings())
  • --parallel-threads doesn't play well with the threaded FreeThreadingTest, get dangling threads sometimes.

  • ./Include/cpython/pyatomic_gcc.h:615:3: warning: ‘atomic_thread_fence’ is not supported with ‘-fsanitize=thread’ [-Wtsan]
    Do we care?

  • And worst of all, cleaned up data races in arraymodule (as far as I can see). I used to be able to run test stuff setting supperssions without problems but with test_array I get:

$ ./python -m test --parallel-threads=4 -j4 test_array
Using random seed: 2354862211
0:00:00 load avg: 0.55 Run 1 test in parallel using 1 worker process
0:00:30 load avg: 0.80 running (1): test_array (30.1 sec)
0:00:41 load avg: 0.75 [1/1/1] test_array worker non-zero exit code (Exit code 66)
==================

WARNING: ThreadSanitizer: data race (pid=1075807)
  Read of size 8 at 0x7f58e22c0098 by thread T4:
    #0 _Py_TYPE Include/object.h:268 (python+0x22a21a)
    #1 Py_IS_TYPE Include/object.h:291 (python+0x22a21a)
    #2 compare_unicode_unicode_threadsafe Objects/dictobject.c:1413 (python+0x22a21a)
    #3 do_lookup Objects/dictobject.c:1010 (python+0x22bb38)
    #4 unicodekeys_lookup_unicode_threadsafe Objects/dictobject.c:1438 (python+0x22bd9b)
    #5 _Py_dict_lookup_threadsafe Objects/dictobject.c:1493 (python+0x232e50)
    #6 _PyDict_GetItemRef_KnownHash Objects/dictobject.c:2342 (python+0x233f39)
    #7 PyDict_GetItemRef Objects/dictobject.c:2378 (python+0x234070)
    #8 cache_struct_converter Modules/_struct.c:2524 (_struct.cpython-314td-x86_64-linux-gnu.so+0xc916)
    #9 calcsize Modules/clinic/_struct.c.h:249 (_struct.cpython-314td-x86_64-linux-gnu.so+0xceb0)
    #10 cfunction_vectorcall_O Objects/methodobject.c:523 (python+0x25961c)
    #11 _PyObject_VectorcallTstate Include/internal/pycore_call.h:167 (python+0x1896b3)
    #12 PyObject_Vectorcall Objects/call.c:327 (python+0x189810)
    #13 _PyEval_EvalFrameDefault Python/generated_cases.c.h:1371 (python+0x42310d)
    ...

  Previous write of size 8 at 0x7f58e22c0098 by thread T1:
    #0 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:799 (libtsan.so.0+0x614cb)
    #1 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:797 (libtsan.so.0+0x614cb)
    #2 memset /usr/include/x86_64-linux-gnu/bits/string_fortified.h:59 (python+0x274169)
    #3 fill_mem_debug Objects/obmalloc.c:2739 (python+0x274169)
    #4 _PyMem_DebugRawAlloc Objects/obmalloc.c:2823 (python+0x27429d)
    #5 _PyMem_DebugRawMalloc Objects/obmalloc.c:2839 (python+0x2742d6)
    #6 _PyMem_DebugMalloc Objects/obmalloc.c:3004 (python+0x274331)
    #7 PyObject_Malloc Objects/obmalloc.c:1412 (python+0x296e92)
    #8 PyUnicode_New Objects/unicodeobject.c:1407 (python+0x31fd7d)
    #9 PyUnicode_Concat Objects/unicodeobject.c:11647 (python+0x335e53)
    #10 PyNumber_Add Objects/abstract.c:1135 (python+0x14e11c)
    #11 _PyEval_EvalFrameDefault Python/generated_cases.c.h:62 (python+0x41b60b)
    ...

WARNING: ThreadSanitizer: data race (pid=1079529)
  Write of size 8 at 0x7f0050cbd0f0 by thread T17:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

  Previous write of size 8 at 0x7f0050cbd0f0 by thread T20:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

Which is not arraymodule stuff and not suppressed. So as of now its not runnable clean under tsan (at least on my system how I am running it).

Left the bad setUp() in test_array for now so u can see but it will be getting removed (or fixed).

@colesbury
Copy link
Contributor

colesbury commented Mar 5, 2025

I'd like test_array added to TSAN_TESTS, not TSAN_PARALLEL_TESTS.

warnings doesn't play nice with --parallel-threads

Yes, warnings is not therad-safe (even with the GIL). Neil is work on making it thread-safe, but for now we shouldn't add the test to TSAN_PARALLEL_TESTS for now. It's not worth trying to work around the warnings stuff.

@tom-pytel
Copy link
Contributor Author

In the meantime, is there any particular test or module which needs attention for your clean tsan project or just everything?

@colesbury
Copy link
Contributor

I think ctypes may be the most important right now. See:

https://github.com/python/cpython/issues?q=is%3Aissue%20state%3Aopen%20ctypes%20label%3Atopic-free-threading

I think @ZeroIntensity started working on it, but I don't think he's working on it right now.

I would really l like to fix:

  • Races when using unrelated ctypes data (e.g., non thread-safe global state)
  • Races when using ctypes data in a read-only manner (e.g., constructing multiple instances of the same type concurrently)

I am less concerned about concurrent modifications to the same ctypes instances. I don't really want to deal with that for now.

@ZeroIntensity
Copy link
Member

Yeah, sorry, that's on my list of things to do. I got burnt out a little while ago and haven't thought of picking it up again. I want to do a couple of things with #129824, and then I'll get back to fixing ctypes in the next week or two.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes look good, but I am still pretty unsure about changing the code to loading and storing items to use FT_ATOMIC_ macros. I think we may be opening a can of worms with all of the different data types.

I think we may want to stick with normal pointer loads and stores. If someone concurrently modifies the same array from multiple threads, that's will be a data race and UB, but the mixing of data types here is also technically UB.

@colesbury
Copy link
Contributor

I'm sorry for the conflicting advice I've given. It's hard to know how we should pursue this without seeing the actual code changes.

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 15, 2025

Most of the changes look good, but I am still pretty unsure about changing the code to loading and storing items to use FT_ATOMIC_ macros. I think we may be opening a can of worms with all of the different data types.

Have a look at the possible memoryview format codes, and there is numpy. They are doable though.

I think we may want to stick with normal pointer loads and stores. If someone concurrently modifies the same array from multiple threads, that's will be a data race and UB, but the mixing of data types here is also technically UB.

IMHO I think you are overly worried about that phrase "undefined behavior" which in practice really just means "undefined (stale) value", otherwise probably no modern system would boot. Also I have never had a problem writing an int to an unsigned int * location or vice versa, but warnings are annoying yes.

In any case changed back to non-atomic get/set. Also removed tsan-specific test.

@tom-pytel
Copy link
Contributor Author

Small detail, you want to leave or remove test_array from TSAN_TESTS?

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments below, but I think the alignment of items is important.

@tom-pytel
Copy link
Contributor Author

Tools/c-analyzer/check-c-globals.py may be a little unstable. It complained on:

#ifdef _MSC_VER

typedef __declspec(align(8)) struct {
    Py_ssize_t allocated;
    char items[];
} arraydata;

#else

typedef struct {
    Py_ssize_t allocated;
    _Alignas(8) char items[];
} arraydata;

#endif

And:

#ifdef _MSC_VER
#define STRUCT_ALIGN __declspec(align(8))
#define ATTR_ALIGN
#else
#define STRUCT_ALIGN
#define ATTR_ALIGN _Alignas(8)
#endif

typedef STRUCT_ALIGN struct {
    Py_ssize_t allocated;
    ATTR_ALIGN char items[];
} arraydata;

for arraymodule.c, leading to what is there now.

@colesbury
Copy link
Contributor

Use something like: https://gist.github.com/colesbury/96f27e2ddf6b151adeeb4c28ed7554d8

The alignment specifier has to be on char items[] because we are putting things that are not char there. It's not useful for it to be on the arraydata struct.

@tom-pytel
Copy link
Contributor Author

Just a minor nit "MS_WINDOWS" or "_MSC_VER"? The latter is used in the codebase for MSVC-specific directives, and you can run gcc under Windows.

@colesbury
Copy link
Contributor

pyport.h uses _MSC_VER for define _Py_thread_local __declspec(thread), so I guess _MSC_VER.

@tom-pytel
Copy link
Contributor Author

Excessive QSBR memory usage: I ran across this while profiling memory usage here. The results are similar for both array and list objects which use QSBR to free memory, so this is a QSBR thing.

Memory usage numbers for script (provided below) using both array and list (attachproc explained below):

VmHWM             array         list
GIL           121140 kB    143796 kB  - normal GIL-enabled baseline
noGIL        6561100 kB   6651608 kB  - free-threaded current QSBR behavior
attachproc    412164 kB    477312 kB  - free-threaded with _PyMem_ProcessDelayed() in _PyThreadState_Attach()

Script:

from queue import Queue

def thrdfunc(queue):
    while True:
        l = queue.get()

        l.append(0)  # force resize in non-parent thread which will free using _PyMem_ProcessDelayed()

queue = Queue(maxsize=2)

threading.Thread(target=thrdfunc, args=(queue,)).start()

while True:
    l = array('i', [0] * int(3840*2160*3/4))  # using int instead of byte for reasons
    # l = [None] * int(3840*2160*3/8)  # sys.getsizeof(l) ~= 3840*2160*3 bytes

    queue.put(l)

Since delayed memory free checks (and subsequent frees if applicable) occur in one of two situations:

  • Garbage collection, which doesn't trigger often enough in this script, though manual trigger solves problem.
  • When the number of delayed memory blocks to free reaches 254 on a delayed free operation, which is a lot for big buffers.

This works great for many small objects, but with larger buffers these can accumulate quickly. I tried a few things but the attachproc numbers above are for a _PyMem_ProcessDelayed() call added to the end of _PyThreadState_Attach() which worked the best so far.

diff --git a/Python/pystate.c b/Python/pystate.c
index ee35f0fa945..d9d731a15bc 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -2169,6 +2169,9 @@ _PyThreadState_Attach(PyThreadState *tstate)
 #if defined(Py_DEBUG)
     errno = err;
 #endif
+#ifdef Py_GIL_DISABLED
+    _PyMem_ProcessDelayed(tstate);
+#endif
 }

 static void

Not saying it is THE solution, but at the very least it shows memory usage can be reduced with no hit to performance (timed that with pyperformance suite vs. normal nogil python and the difference is a quarter of a percent, variance basically).

Another option would be to add another check in free_delayed() to decide when to run _PyMem_ProcessDelayed() based on the total amount of memory deferred for free (which can be queried per pointer from mimalloc).

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants