Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix logic error and race condition #8

Merged
merged 1 commit into from
Apr 14, 2024
Merged

fix logic error and race condition #8

merged 1 commit into from
Apr 14, 2024

Conversation

zerbina
Copy link
Contributor

@zerbina zerbina commented Apr 13, 2024

There were both a race condition and logic error with the implementation, which could lead to spurious signals, crashes, and other undefined behaviour.

Race Condition

Thread 1:
...

  1. suspend: acquires lock for actor A and pool
  2. changes the state to Suspended
  3. suspend: releases the locks
  4. pauses

Thread 2:
....

  1. resume: acquires lock for actor A and pool
  2. actor state is Suspended, so the actor is queued
  3. resume: releases the locks
  4. ...
  5. actor A is popped from the work queue and resumes execution

Both thread 1 and 2 now have unguarded concurrent ownership of actor A, which constitutes a race condition.

Logic Error

Note: this is only one example of what could happen.

Thread 1:
...

  1. suspend: acquires lock for actor A and pool
  2. changes the state to Suspended
  3. releases the lock for actor A and and pool
  4. pauses

Thread 2:
...

  1. resume: acquires lock for actor A and pool
  2. actor state is Suspend, so the actor is queued
  3. ...
  4. actor A resumes execution
  5. jield is called, and the state of actor A changes to Jielding
  6. ...
  7. workerThread: fetches the state of actor A (Jielding)
  8. pauses

Thread 1:

  1. workerThread: fetches the state of actor A (Jielding)
  2. queues actor A for execution and changes the state to Queued
  3. pauses

Thread 2:

  1. the local state variable still contains Jielding, so the actor is queued again

The same actor is now in the work queue two times!

The Solution

In both cases, the problem is that another thread can assume ownership of an actor while the worker thread responsible for the actor hasn't finished suspending it, due to the actor state being eagerly set to Suspended by suspend.

suspend now sets the state to the new Suspending state first. When handling the Suspending state, if new signals were received in the meantime (sendSig cannot resume a suspending actor), the actor is queued again right away. Otherwise, it's suspended.

There were both a race condition and logic error with the
implementation, which could lead to spurious signals, crashes, and
other undefined behaviour.

Race Condition
--------------

Thread 1:
...
1. `suspend`: acquires lock for actor `A` and pool
2. changes the state to `Suspended`
3. `suspend`: releases the locks
4. pauses

Thread 2:
....
1. `resume`: acquires lock for actor `A` and pool
2. actor state is `Suspended`, so the actor is queued
3. `resume`: releases the locks
4. ...
5. actor `A` is popped from the work queue and resumes execution

Both thread 1 and 2 now have unguarded concurrent ownership of actor
`A`, which constitutes a race condition.

Logic Error
-----------

Note: this is only *one* example of what could happen.

Thread 1:
...
1. `suspend`: acquires lock for actor `A` and pool
2. changes the state to `Suspended`
3. releases the lock for actor `A` and and pool
4. pauses

Thread 2:
...
1. `resume`: acquires lock for actor `A` and pool
2. actor state is `Suspend`, so the actor is queued
3. ...
4. actor `A` resumes execution
5. `jield` is called, and the state of actor `A` changes to `Jielding`
6. ...
7. `workerThread`: fetches the state of actor `A` (`Jielding`)
8. pauses

Thread 1:
1. `workerThread`: fetches the state of actor `A` (`Jielding`)
2. queues actor `A` for execution and changes the state to `Queued`
3. pauses

Thread 2:
1. the local `state` variable still contains `Jielding`, so the actor
   is queued again

The same actor is now in the work queue two times!

The Solution
------------

In both cases, the problem is that another thread can assume ownership
of an actor while the worker thread responsible for the actor hasn't
finished suspending it, due to the actor state being eagerly set to
`Suspended` by `suspend`.

`suspend` now sets the state to the new `Suspending` state first. When
handling the `Suspending` state, if new signals were received in the
meantime (`sendSig` cannot resume a *suspending* actor), the actor is
queued again right away. Otherwise, it's suspended.
@zevv
Copy link
Owner

zevv commented Apr 14, 2024

Thanks for this!

@zevv zevv merged commit 94e9831 into zevv:master Apr 14, 2024
1 of 3 checks passed
@zerbina zerbina deleted the fix-pool-bug branch April 15, 2024 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants