Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condition/RecursiveLock: add ability to handle threads #30061

Merged
merged 1 commit into from
Dec 18, 2018
Merged

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Nov 16, 2018

This extends Condition and RecursiveLock to assert that they may only be used in the single-threaded case (co-operatively scheduled), and then adds thread-safe versions of the same (ConditionMT and RecursiveLockMT). Additionally, the constructor for ConditionMT lets you pass in an existing RecursiveLockMT to allow having multiple notifications off of a single lock.

Unlike the existing thread-safe primitives (Threads.AsyncCondition, Threads.SpinLock, Threads.Mutex), these new types integrate with the Task system also, and thus should typically be preferred to those for user code. Unlike the existing task-only primitives, these work in all situations, but require the user to write explicit lock code annotating the protected code regions.

In version 2.0, we can consider switching the default for Condition from ConditionST to ConditionMT, but that would be a breaking change. (ConditionMT requires the user to hold a lock before calling wait, while that lock is currently implicit in Condition under the assumption of cooperative tasking—as explicitly represented in this PR by the NotALock type.)

Implement and close #30026

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This *MT naming convention seems a bit unfortunate. Is it correct to use Condition when Julia is multithreaded? Or ConditionMT when Julia is single-threaded?

base/event.jl Outdated Show resolved Hide resolved
base/event.jl Outdated Show resolved Hide resolved
@ararslan ararslan added the multithreading Base.Threads and related functionality label Nov 16, 2018
@JeffBezanson
Copy link
Member

This *MT naming convention seems a bit unfortunate. Is it correct to use Condition when Julia is multithreaded? Or ConditionMT when Julia is single-threaded?

I agree --- code should be independent of the number of threads as much as possible. I think these can be seen as different ways of using conditions, rather than different kinds of conditions: in one case, you just want to be woken up and don't need to test a predicate, and in the other case you need to test a predicate and so need to continue holding a lock when wait returns. Maybe that could be an option passed to wait. Another way to look at it is that code that waits and then tests a predicate without holding the lock is not thread-safe and may need to change, but that is local to the caller of wait and doesn't need to affect the code that allocates the Condition.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 18, 2018

This *MT naming convention seems a bit unfortunate. Is it correct to use Condition when Julia is multithreaded? Or ConditionMT when Julia is single-threaded?

It's a statement of the algorithm being used (whether concurrent modifications are supported), not the configuration of Julia. You can always use a thread-safe variant of any object (hence in 2.0, we might decide to make that the default) by just accepting the performance hit, but the ability to avoid the extra overhead of a lock while using the same code might remain desirable.

code should be independent of the number of threads as much as possible.

The only way to do that is to make everything very slow. I strongly disagree with this.

that is local to the caller of wait and doesn't need to affect the code that allocates the Condition.

I don't think a caller can be local, since something external ("the caller" could be anyone on the callstack) is, by definition, non-local to the callee.

doesn't need to affect the code that allocates the Condition.

Um, that's what C does, and is precisely what we just said we don't want to do (#30026)

@JeffBezanson
Copy link
Member

No, what I'm saying is that we should try to skip ahead to making all of these thread-safe by default. I think it might be possible to arrange the API to enable that. For example, Condition() allocates a lock for itself, and wait(::Condition) acquires it first. Then the only remaining difference is whether the code calling wait expects to be holding a lock when it returns. For backwards compatibility, wait would have to release the lock. Then code that tests a predicate after wait, and that wants to become thread-safe, needs to call e.g. wait(c, locked=true) instead. That seems like an easier upgrade path to me than changing Condition to ConditionMT and adding a lock acquire and release to each use.

I feel more strongly about not having, or strongly discouraging using, a thread-unsafe lock. That seems like a logical place to be thread-safe by default. If it's truly necessary to have a task-only lock for performance tuning, that can be added later when needed and hidden away somewhere, possibly not even exported.

@vtjnash
Copy link
Member Author

vtjnash commented Nov 19, 2018

No. I don’t think you understand what’s going on here. I’ve been thinking about this a lot for the past couple months, and haven’t been able to find a substitute (sadly) and hence am finally turning this into a PR.

If you aren’t already holding the lock when you call ‘wait’, your code is broken. It’s absolutely useless to break the API just to return with the lock held (I do it here because it’s convenient and makes the implementation a bit more powerful around RecursiveLock—but it’s absolutely just optional code).

The options I’m aware of are: (a) break backwards compatibility and require you to hold the lock first (b) declare in the constructor whether correct locking is mandatory (this PR) or (c) declare / detect in the wait call whether you previously acquired the lock, and branch to the appropriate implementation at runtime

@JeffBezanson
Copy link
Member

(c) declare / detect in the wait call whether you previously acquired the lock, and branch to the appropriate implementation at runtime

I think that's fairly close to what I'm proposing.

Condition is tricky, but I think we can at least get away with having only the thread-safe flavor of lock. Will be nice not to have both lock.jl and locks.jl...

@vtjnash
Copy link
Member Author

vtjnash commented Nov 19, 2018

Yes, I was actively testing the branch with the switch to change the default lock type to MT-safe, and that seems to build and work. Since those have the same API, your proposal seems like an obvious improvement.

I could switch Condition to ConditionMT, but (for backwards compatibility), that requires removing the concurrency violation errors in some places, and replacing them with some logic to bypass thread-safety instead when we detect that case. It's not my favorite option (Since I like that with the current state of the PR, the user states. via the choice of constructor, whether their code is updated to be thread-safe. And I think it's a bit easier to assert than to generally handle this case.). However, I don't foresee any actual blockers in getting the test-suite to pass with that option. (Implementation-wise, this might mean adding a flag to RecursiveLock of whether to disable error checking, or perhaps just handling it specially inside notify and wait)

@vtjnash
Copy link
Member Author

vtjnash commented Nov 20, 2018

Actually, in talking with Kiran today, there actually is another option (d). In this PR, I'm using the word Condition in the same sense as pthreads/C/Java/TBB/Winnt/etc. of a queue of threads waiting for an event. But we could also use the word in the sense of the simple notification trigger itself, or what might be called an Event in some other contexts (such as kevent/epoll/Winnt/Julia).

This would be an intermediate approach that alters some results but doesn't outright break all usage. In exchange, we might be able to gain partial MT-safety for the existing Condition object. In this option, we would instead deprecate Event and upgrade Condition to implement a notification trigger. Similar to kevent and epoll, it would be configurable as both a level-triggered (was Event) and edge-triggered (was Condition) variant (via a mutable runtime flag).

The trade-off here (pros and cons, in no particular order) is that

  • the first wake-up might now be spurious (and/or we might be able to provide an API to clear a stale event trigger), which might be considered breaking anyways;
  • it is not MT-safe for multiple simultaneous waiting threads (some threads may miss the wakeup notification), although it would sometimes be safe for the notify wakeup to happen asynchronously;
  • it seems like this might be a bit less orthogonal to the other APIs (it seems to do a little bit of everything—blocking Channels, EventMT, notify, atomics);
  • it might be faster someday than the MT-safe ConditionMT (since we might someday be able to avoid taking a lock, and that might save an atomic operation);
  • it also may be better representation of a repeated event such as UDP traffic, ZMQ pub/sub, or fast interval timers, and other similar cases where we don't care about missing some event notifications.

@vtjnash vtjnash added the DO NOT MERGE Do not merge this PR! label Nov 27, 2018
@vtjnash
Copy link
Member Author

vtjnash commented Nov 27, 2018

I've now added another commit demonstrating the ability for the Condition type to branch at runtime to ST or MT mode, based on whether we specify in the constructor that it requires—and thus will obey—the additional stipulations of the thread-safe API. (per off-line conversations with Jeff et al.)

As stated in the commit description, this is a fully functional prototype, but not necessarily a final implementation. As such, it still leaves in the machinery to define, use, and experiment with the *ST and *MT aliases. Once we settle on a design, I'll go back through and remove any content and flexibility that we didn't end up wanting.

@NHDaly
Copy link
Member

NHDaly commented Dec 5, 2018

We talked some last week about other options besides what is currently implemented in this PR. Is what we discussed then basically equivalent to option (c) above?

If so, if I remember correctly, the tradeoff was essentially:
PRO: automatically upgrading code in-place to be thread-safe wrt to the Condition's internals (even if it's leaving the surrounding code potentially thread un-safe)
CON: lose the ability to throw assertions in code that is not yet upgraded for thread safety.

Is that more or less right? If so, it seems to me that this tradeoff is not worth it.

Am I characterizing the discussion correctly? I'm sorry that I don't remember all of the details. Did we come to a consensus that I'm not remembering?

@vtjnash vtjnash added the triage This should be discussed on a triage call label Dec 6, 2018
@JeffBezanson
Copy link
Member

@vtjnash brought up the point that in the future there might be multi-threaded and single-threaded versions of various things, like HashMap vs. ConcurrentHashMap. So we might want to think about this from the perspective of having a general naming convention, which we can then also use here. Some possibilities:

  • Condition, ConditionMT
  • Condition, Threads.Condition
  • Condition, ConcurrentCondition

?

@vtjnash
Copy link
Member Author

vtjnash commented Dec 7, 2018

Talking more with Jeff yesterday, we proposed the following actions:

  • deprecate Condition (silently for now, removal targeted for v2.0)
  • introduce Threads.Condition with thread-safe API (name TBD, as mentioned above)
  • implement the behavior of the current Condition as an option for Event (future work, since we aren't using either right now)
  • everything else here (ReentrantLock, Channel, Event) to become thread-safe by default

@NHDaly
Copy link
Member

NHDaly commented Dec 7, 2018

+10 I think that all sounds really great to me!

I agree that the current behavior of Condition belongs with Event, and that nicely solves the naming problems. +1 and thanks for thinking hard about this! :)

@JeffBezanson
Copy link
Member

Another option for resolving the Condition name conflict is to use more of the full term "condition variable", e.g. ConditionVariable, ConditionVar, CondVar, ...

@vtjnash vtjnash force-pushed the jn/condition-mt branch 4 times, most recently from 324c2d5 to 06caab5 Compare December 11, 2018 19:12
@vtjnash vtjnash removed DO NOT MERGE Do not merge this PR! triage This should be discussed on a triage call labels Dec 11, 2018
@JeffBezanson
Copy link
Member

👍 Looks good. We'll have to wait for the 1.1 news to be moved aside then rebase.

base/event.jl Outdated Show resolved Hide resolved
base/event.jl Show resolved Hide resolved
@NHDaly
Copy link
Member

NHDaly commented Dec 13, 2018

Yay! Thanks for the hard work on this, Jameson! It looks really good! :))

@vtjnash vtjnash force-pushed the jn/condition-mt branch 3 times, most recently from b0c2e18 to a6ea3ef Compare December 17, 2018 02:48
This extends Condition to assert that it may only be used
in the single-threaded case (co-operatively scheduled),
and then adds a thread-safe version of the same:
Threads.Condition.

Additionally, it also upgrades ReentrantLock, etc. to be thread-safe.
assert_havelock(l::AbstractLock, tid::Nothing) = error("concurrency violation detected")

"""
AlwaysLockedST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the only case where the original *ST and *MT naming has remained in the merged PR. Was that intentional or was this just overlooked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there aren't any MT and this type is only meant to be used internally. We could rename it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's what I was thinking—just calling it AlwaysLocked would be good.

ararslan added a commit to invenia/RemoteSemaphores.jl that referenced this pull request Feb 19, 2019
In Julia 1.2, the error that occurs when releasing a semaphore with
mismatched release/acquire counts is now an ErrorException, whereas
previously it was an AssertionError.

See JuliaLang/julia#30061
ararslan added a commit to invenia/RemoteSemaphores.jl that referenced this pull request Feb 19, 2019
In Julia 1.2, the error that occurs when releasing a semaphore with
mismatched release/acquire counts is now an ErrorException, whereas
previously it was an AssertionError.

See JuliaLang/julia#30061
@KristofferC KristofferC mentioned this pull request May 15, 2019
58 tasks
@NHDaly
Copy link
Member

NHDaly commented Sep 4, 2019

@vtjnash I wonder if it was a mistake to allow the old-style, non-thread-safe Conditions to be locked/unlocked.

This makes it somewhat easy to hit the following gotcha (which happened to me today):

julia> c = Condition()
Base.GenericCondition{Base.AlwaysLockedST}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.AlwaysLockedST(1))

julia> lock(c)

julia> Threads.@spawn begin lock(c) ;@info "HI" ;unlock(c) end
[ Info: HI
Task (done) @0x000000011ce3ead0

I forgot that I was using Condition instead of Threads.Condition in my code, so i was seeing a race-condition I didn't expect. It took me a while to finally narrow it down to the above MWE, and that's when I finally remembered that there is a new Threads.Condition I'm supposed to be using, and that lock(::Condition) is a no-op.

I think it would be better to have lock(::Condition) throw an error, rather than be a no-op, because there is simply no good reason to use the old-style Conditions with locks. If you're using locks, you almost certainly mean to be using the new, thread-safe, Threads.Condition.

Can we throw an error to enforce that, so that users aren't caught by this surprise like I was? :) What do you think? Thanks! :)

@JeffBezanson
Copy link
Member

I believe Condition throws an error if another thread tries to use it. What was the race condition?

@NHDaly
Copy link
Member

NHDaly commented Sep 4, 2019

It seems that the base Condition does not throw an error, as shown in the code I posted above. That right there is a race-condition, because two threads are proceeding at the same time when I thought they would be locked. And it didn't throw an error like I would want it to.

The full situation I was in was trying to support Conditions in https://github.com/NHDaly/Select.jl (this is how I discovered the second problem in NHDaly/Select.jl#2).

When I was experimenting with it, i was accidentally using the wrong kind of Condition and didn't notice, and I was really confused by the behavior. I managed to finally figure it out by adding print-statements to this toy example:

julia> using Test; using Base.Threads: @spawn

julia> @sync begin
           coordinator = Channel()

           c1 = Condition()
           c2 = Condition()

           wait_on_conds() = begin
               @info "lock(c1)"
               lock(c1)
               @info "lock(c2)"
               lock(c2)
               put!(coordinator, 0)  # Notify main task that we're waiting. (Note that it will block on lock until we're actually asleep.)
               @async begin wait(c1); @info "c1" end
               @async begin wait(c2); @info "c2" end
               unlock(c2)
               unlock(c1)
           end

           t1 = @async wait_on_conds()
           t2 = @async wait_on_conds()

           take!(coordinator); take!(coordinator)  # Wait for both tasks to start waiting

           # Now, everyone is ready, so notifying c1 will wake up _both_ tasks
           lock(c1)
           @test notify(c1) == 2   # ERROR: This returns `1` instead, because the `lock` doesn't actually block until `t2` has waited like I expected.
           unlock(c1)
       end

(In-fact, I finally realized that this test wasn't going to work anyway, because wait(c) on the new task doesn't release the lock on the parent task, so this wouldn't work. (And if you use actual Threads.Conditions here, you get an error about it.) But in this case, I didn't notice that problem because the Base.Conditions hid the error, and my code proceeds to the @test before both tasks have started waiting, which is a race-condition_.)

@JeffBezanson
Copy link
Member

I get

julia> c = Condition()
Base.GenericCondition{Base.AlwaysLockedST}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.AlwaysLockedST(1))

julia> lock(c)

julia> Threads.@spawn begin lock(c) ;@info "HI" ;unlock(c) end
Task (failed) @0x00007f56fe78e710
concurrency violation detected
error(::String) at ./error.jl:33
concurrency_violation() at ./condition.jl:8
assert_havelock at ./condition.jl:26 [inlined]
assert_havelock at ./condition.jl:49 [inlined]
lock at ./condition.jl:50 [inlined]
lock(::Base.GenericCondition{Base.AlwaysLockedST}) at ./condition.jl:74

I suppose the issue is that Base.Condition only really exists for backwards compatibility, and it never used to require or even allow locking. So yes, I can see the case for disallowing locking on them.

@NHDaly
Copy link
Member

NHDaly commented Sep 4, 2019

Weeirrrrd. It happily continues in parallel for me on both 1.3 and my 1-day old master 1.4.

julia> versioninfo()
Julia Version 1.3.0-rc1.0
Commit 768b25f6a8* (2019-08-18 00:04 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
julia> versioninfo()
Julia Version 1.4.0-DEV.79
Commit d3250fe005* (2019-09-02 19:07 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

@NHDaly
Copy link
Member

NHDaly commented Sep 4, 2019

I suppose the issue is that Base.Condition only really exists for backwards compatibility, and it never used to require or even allow locking. So yes, I can see the case for disallowing locking on them.

But yeah, exactly. That's my thinking as well. I think having the two types can lead to surprises like this, so any amount of failing early would be much appreciated! ❤️ Erroring on lock seems like one nice way to do that, but we should do that everywhere we can.

Ideallllly, it seems like it would be nice to consider renaming Base.Condition to something like Base.SingleThreadedCondition, find/replacing all of base to use that name, and then changing the default, unqualified Condition to be the new one and just say "SORRY" to any user code which will now throw an error that it needs to be locked. It's an easy error to fix, i think, and will be a nice reminder to start thinking about thread-safety. And I'd imagine there aren't too many uses of Condition in packages, anyway.

@NHDaly
Copy link
Member

NHDaly commented Sep 4, 2019

This is kind of a corny solution, but I've opened #33162 as a straw-person

EDIT: 👍 Thanks, this was merged. :) That should address my concerns

Keno pushed a commit that referenced this pull request Jun 5, 2024
This extends Condition to assert that it may only be used
in the single-threaded case (co-operatively scheduled),
and then adds a thread-safe version of the same:
`Threads.Condition`.

Additionally, it also upgrades ReentrantLock, etc. to be thread-safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelism: Should wait(::Condition) take a Mutex to unlock/lock?
5 participants