nogil violation of atomicity on set operations #126136

luisggpina · 2024-10-29T18:12:13Z

Bug report

Bug description:

Hi,

We're a research group focused on testing concurrent runtimes. Our work-in-progress prototype found a violation of atomicity on the current nogil build when using concurrent operations on the same set. The program below shows the wrong behavior.

from threading import Thread,Barrier

def t1(b1,s):
    b1.wait()
    s.intersection_update({1,5})

def t2(b1,s):
    b1.wait()
    s.symmetric_difference_update({1, 8, 7, 5, 9})

def normalSetTest(i):

    s = {1,2,3,4,5,6}
    setCombos = [{7,8,9}, set()] # all possible results
    barrier = Barrier(2)
    threads = [ Thread(target= t1, args=(barrier,s,)), Thread(target= t2, args=(barrier,s,)) ]


    for t in threads:
        t.start()
    for t in threads:
        t.join()
    if s not in setCombos:
        print("\tfound bug on iter " + str(i) + ": " + str(s))


print("test begin...")

for n in range(0,2000):
    threads = []

    if n % 100 == 0:
        print(n)

    for i in range(0,100):
        threads.append(Thread(target= normalSetTest, args=(n,)))

    for t in threads:
        t.start()

    for t in threads:
        t.join()

print("test done")

A set starts with {1,2,3,4,5,6} and two threads each make a concurrent operation on it: intersection_update({1,5}) and symmetric_difference_update({1, 8, 7, 5, 9}).

Depending on the order of execution, the possible results are either {7,8,9} or {} (the empty set).

Running the test harness above shows that the (incorrect) result {1, 5} is sometimes possible.

If you have trouble observing this behavior, feel free to increase the first range's last argument to a large integer (e.g., from 2000 to 10000).

We replicated this bug on 3 different machines with varying numbers of cores; and we were able to find it even using a single pair of threads (i.e., the second range's argument from 100 to 1). The current harness uses 100 pairs of threads operating on unrelated sets to speed up bug discovery.

We're happy to provide more details about this bug, and to help developers reproducing it.

Output of python -VV: Python 3.14.0a1+ experimental free-threading build (heads/main:faa3272fb8d, Oct 29 2024, 09:14:25) [GCC 14.2.1 20240805]

Example output:

test begin...
0
100
200
300
        found bug on iter 346: {1, 5}
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
        found bug on iter 1678: {1, 5}
1700
1800
        found bug on iter 1864: {1, 5}
        found bug on iter 1882: {1, 5}
1900
test Done

@flypoodles and @overlorde are part of the team, adding them so they get notified about further discussion.

CPython versions tested on:

3.13, 3.14, CPython main branch

Operating systems tested on:

Linux

The text was updated successfully, but these errors were encountered:

ZeroIntensity · 2024-10-29T20:11:25Z

Well, the free threading docs explicitly note that you shouldn't rely on the internal locks for thread safety--if it does work at all, it's considered an implementation detail, not a full feature. The internal locks are really only for thread-safe reference counting, they don't have any guarantees about what happens to the object when using it concurrently.

That said, maybe this is a bug with concurrency on set somehow, depending on what's going on. Does this still occur when using a proper threading.Lock?

colesbury · 2024-10-29T20:20:21Z

I'm not sure how we want to deal with this yet, but I think these reports are useful regardless of whether the behavior constitutes a "bug" or not.

luisggpina · 2024-10-30T19:44:42Z

Thank you for your insight into this behavior.

We're very excited about the work on removing the GIL and want to help in finding weird combinations of operations that lead to unexpected results. Such combinations are very hard to find by manual inspection of the code, but we have a tool (under development) that can find some of them and even generate nice testing harnesses like the one I shared.

We're guided by the text in the original PEP, in particular this passage:

This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object's lock.

The text suggests that the methods in our report should hold a per-object lock (on the set itself) for their whole duration such that they execute atomically as far as the set being destructively mutated is concerned.

Of course, the PEP states that the locks offer less protection than the GIL, but still retaining mutual exclusion for methods operating on the same object:

Instead, per-object locking aims for similar protections as the GIL, but with mutual exclusion limited to individual objects.

So perhaps weird results due to the argument being passed to each method being concurrently mutated are allowed. We are not looking for those, only for how the set itself behaves.

Anyway, I understand that @colesbury is the author of the PEP and it seems that wether this behavior should be allowable is still being considered. Please don't read my post as being pedantic on the verbiage used in the PEP. Apologies if I'm not understanding the PEP correctly, even after reading it carefully a few times; but it seems to me that this behavior should not be allowed.

luisggpina added the type-bug An unexpected behavior, bug, or error label Oct 29, 2024

picnixz added topic-free-threading interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 29, 2024

ZeroIntensity added 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 29, 2024

luisggpina mentioned this issue Nov 7, 2024

nogil inconsistent empty list while sorting #126559

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nogil violation of atomicity on set operations #126136

nogil violation of atomicity on set operations #126136

luisggpina commented Oct 29, 2024 •

edited

Loading

ZeroIntensity commented Oct 29, 2024

colesbury commented Oct 29, 2024

luisggpina commented Oct 30, 2024

nogil violation of atomicity on set operations #126136

nogil violation of atomicity on set operations #126136

Comments

luisggpina commented Oct 29, 2024 • edited Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

ZeroIntensity commented Oct 29, 2024

colesbury commented Oct 29, 2024

luisggpina commented Oct 30, 2024

luisggpina commented Oct 29, 2024 •

edited

Loading