Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden Queue#take's cancelation properties #2920

Closed
djspiewak opened this issue Mar 26, 2022 · 9 comments · Fixed by #3000
Closed

Harden Queue#take's cancelation properties #2920

djspiewak opened this issue Mar 26, 2022 · 9 comments · Fixed by #3000
Labels
Milestone

Comments

@djspiewak
Copy link
Member

take must have the property such that it either does not remove the element, or it removes the element and returns it, even when canceled. Thus the whole thing must be uncancelable aside from an atomic poll (when semantic blocking is needed). Without this property, it is possible in rare race conditions to cancel a take and lose data.

Applies to both the existing Concurrent queue and the new Async queue in #2885

@djspiewak djspiewak added this to the v3.4.0 milestone Mar 26, 2022
@durban
Copy link
Contributor

durban commented Mar 31, 2022

Should the guarantee be that code like this never loses the item:

    val tsk = IO.uncancelable { poll =>
      poll(q.take).flatTap { taken =>
        // save taken item somewhere
      }
    }

So even if tsk is cancelled, it must either not remove the item from the queue, or else it must save the removed item "somewhere".

It seems that currently code like this can indeed lose the item. I think (one possible) reason is this line in take:

val cleanup = state.update { s => s.copy(takers = s.takers.filter(_ ne taker)) }

This runs when cancelling take. It simply removes the Deferred if it's present. However, if the Deferred was already removed, that case shouldn't be ignored. (Although I'm not sure what to do in that case... possibly this needs fixing in offer, because take may not be able to do the correct thing there.)

@durban
Copy link
Contributor

durban commented Mar 31, 2022

Okay, so "the whole thing must be uncancelable aside from an atomic poll" seems to be already true, if I'm reading it correctly. It is also not enough.

I think what would need to happen in this case is that when the fiber running q.offer removes the deferred, it also atomically (with the removal) must make the fiber running poll(q.take) uncancelable. I don't know if that's possible with the current API...

Or, alternatively, take should be able to ignore the cancellation if it detects that the deferred was already removed. I don't think that's possible...

@djspiewak
Copy link
Member Author

Going to write some more here in a bit, but to be extremely concrete, this is the test which I'm pretty sure fails right now and what I feel like we need to pass:

"never lose data" in {
  // this is a race condition test so we iterate
  val test = 0.until(10000).toList traverse_ { _ =>
    for {
      q <- Queue.bounded[IO, Int](100)
      _ <- 0.until(100).toList.traverse_(q.offer(_))

      latch <- IO.deferred[Unit]
      backR <- IO.ref(Vector[Int]())

      fiber <- {
        val action = for {
          // take half of the queue's contents
          front <- 0.until(50).toVector.traverse(_ => q.take)
          _ <- backR.set(front)

          // release the canceler to race with us
          _ <- latch.complete(())

          // take the other half of the contents with atomic writing
          _ <- 50.until(100).toVector traverse { i =>
            IO uncancelable { poll =>
              // if data is lost, it would need to manifest here
              // specifically, take would claim a value but flatMap wouldn't run
              poll(q.take).flatMap(a => backR.update(_ :+ a))
            }
          }
        } yield ()

        action.start
      }

      _ <- latch.get
      _ <- fiber.cancel

      // grab whatever is left in the queue
      remainderOpt <- q.tryTakeN(None)
      remainderVec = remainderOpt.getOrElse(Nil).toVector
      _ <- backR.update(_ ++ remainderVec)

      // if we lost data, we'll be missing a value in backR
      results <- backR.get
      _ <- IO(results must contain(0.until(100)))
    } yield ()
  }

  test.as(ok)
}

@durban
Copy link
Contributor

durban commented Apr 1, 2022

Yeah, okay, I believe my pseudocode is the same as the important part of that test. And I can confirm, that an item can become "lost" that way.

I think this is a really difficult problem. I try to describe it with an example: q is an initially empty queue, and we have 3 fibers:

  • fiber A executes the poll(q.take).flatTap { save it somewhere... } part
  • fiber B executes q.offer(x)
  • and fiber C cancels fiber A (at the worst possible time)

I think what could happen (and what causes the item to be lost) is this:

  1. A inserts its deferred into the queue state, and becomes suspended waiting on that deferred. A is cancellable at this point (it's inside the poll).
  2. B atomically removes the deferred from the queue state. (After this event, take logically removed the item x from the queue.)
  3. B completes the deferred.
  4. A is woken up by the deferred, and continues its execution.
  5. A encounters the UnmaskK instruction: this makes A uncancelable.
  6. A continues with the flatTap.

After 2., the item x is irrevocably removed from the queue. But it's only after 5. that fiber A becomes uncancelable. So if C cancels it between 2. and 5., the item is effectively "lost".

@djspiewak
Copy link
Member Author

Okay, philosophy time. Proactively paging @SystemFw for rebuttal.

In a MPMC queue, ordering and fairness are not properties which are strongly defined. Ordering with respect to what? If you have two producers which generate values A and B close together in time, then it's really unclear which one is actually "first". Of course, the queue data structure forces us to pick a winner here, so we say that one of them is "first", but is it really? It is not. And note that this applies even with MPSC.

The same argument can be made with SPMC. A single producer generates A and B while two consumers are waiting. Who's to say which consumer is first to the value? When parallel events are close together in time, ordering is not even a meaningful thing to discuss. Note that the pathologies with SPMC only occur when multiple events are being produced close together in time; singular events do not have this issue.

Taking a step even further back, all of this stuff is based on compareAndSet, which does not guarantee ordering! Ordering (and therefore fairness) is not an absolute, but more of a stochastic weighting and a spectrum. So long as we're mostly first-in, first-out with escalating Heisenberg uncertainty as events get closer together in time, we're still preserving those properties.

So having established that concurrent ordering isn't an absolute, let's look at the broader picture a bit. "At least once" semantics for Queue really aren't very useful because users cannot respond. In particular, if take claims an element from the data structure and then is canceled before control flow has a chance to reenter user space, the results will be unavoidable data loss. Now this is fine if we're in a single consumer scenario, since then the queue is shutting down and it probably doesn't matter, but precisely because we allow multiple consumers, this is just untenable. You can only resolve this right now by having a secondary queue which ACKs back to the producer(s), and then you're literally just back in networking land. That really isn't a good option.

Conversely, if we strengthen Queue to be "exactly once", the property we must sacrifice is deterministic ordering on closely-sequenced events. This is fine though, because as we previously established, this ordering is already an illusion! And even if it weren't an illusion, it is possible for users to respond to ordering issues by imposing a monotonic sequencing protocol on the events, and this is a much simpler lift than a two-phase commit ACK protocol.

I'm pretty convinced that the only useful semantic here is "exactly once" and we really have to maintain it, even when it means corrupting fairness or ordering, so long as fairness and ordering are preserved in sparser temporal sequencing cases, and relative partial ordering within a consumer is guaranteed, and both of these properties are things that we can do.


Speaking more concretely… We can achieve these properties by changing the takers and offerers to be a SQueue[Deferred[F, Unit]] instead of the current Deferred[F, A]. This is very similar to how #2914 (and friends) behave already.

@SystemFw
Copy link
Contributor

SystemFw commented Apr 1, 2022

Proactively paging @SystemFw for rebuttal

I've been thinking about this for some days actually, and I'm almost convinced :) , especially if the argument is made in general, and not just for the purpose of implementing Channel on top of Queue (which is of course nice to have as well). In particular

really aren't very useful because users cannot respond

is a very strong argument, and in some ways similar to the argument I made for poll in the Semaphore + Queue scenario.

We can achieve these properties by changing the takers and offerers to be a SQueue[Deferred[F, Unit]]

I'm also generally a fan of this strategy, I used it for Channel and various other things and went so far as proposing it as a general approach (link)


With that being said though, I want the tradeoff we're making to be absolutely clear, cause I think the argument somewhat brushes over it:

A single producer generates A and B while two consumers are waiting. Who's to say which consumer is first to the value? When parallel events are close together in time, ordering is not even a meaningful thing to discuss.

I think ordering comes into account in two ways: 1) there is no guarantee of ordering, but there is close to infinite probability that there is indeed an order, and 2) there is a guaranteed happens-before relationship. If you have neither, then you can say "order is not meaningful", if you do, I feel the argument is weaker.
In particular:

Queue[IO, Int].unbounded.flatMap { q =>
   q.take.start >>
   IO.sleep(2.days) >>
   q.take.start >>
   IO.sleep(2.days)
   q.offer(1) >>
   q.offer(2)
   }

now, if we describe ordering as "in which order should elements be processed", then I agree that the very notion of *PMC queues is against it: because of concurrent consumers, you've already relinquished order of processing. If we instead define ordering as "who should get which element" or indeed "which producer is first to the value", then I think in that example it's pretty obvious what the ordering should be: the consumers fall into the "infinite probability" case, and producer falls into the happens-before case. The proposed strategy breaks this latter notion of ordering if you are unlucky with your races (read this for a worked out example)

If everyone is onboard with this limitation, I'll be onboard as well since I can see the general argument for it

@armanbilge
Copy link
Member

Question: is there no relevant prior art on this issue? If not, I assume it's because the CE cancellation model is so original? Like, what does everyone else do 😅

@djspiewak
Copy link
Member Author

now, if we describe ordering as "in which order should elements be processed", then I agree that the very notion of *PMC queues is against it: because of concurrent consumers, you've already relinquished order of processing. If we instead define ordering as "who should get which element" or indeed "which producer is first to the value", then I think in that example it's pretty obvious what the ordering should be: the consumers fall into the "infinite probability" case, and producer falls into the happens-before case. The proposed strategy breaks typelevel/fs2#2856 (comment) latter notion of ordering if you are unlucky with your races (read this for a worked out example)

This is a great summary, and it goes directly to where I think the distinction in definitions lies. I would define ordering as the following pair of things:

  • The order in which elements are processed relative to the order in which they are produced
  • The order in which consumers are notified relative to the order in which they are registered

These are both, fundamentally, the same problem. In a sense, the registration of consumers is effectively like producing a listener message back to the producers, in the same way that the production of elements is producing an element forward to the consumers. Thus, I believe the guarantees on both should be symmetrical. In a MP queue scenario, ordering on element production is degraded. In a MC queue scenario, ordering on consumer registration is degraded. (also, I say "degraded" and not "lost" because, as you noted, the probability of strong ordering is asymptotic as relative temporal delta increases)

So to that end, the limitation here is that, in any MC scenario, we cannot guarantee strong wake-up ordering between consumers in closely interspersed events, just as in any MP scenario we cannot guarantee strong ordering between elements in closely interspersed events. I believe that limitation is acceptable.

@durban
Copy link
Contributor

durban commented Apr 3, 2022

Regarding decreasing fairness: in my opinion, it's probably fine. As far as I understand #2885, actual FIFO semantics are not broken. An intuitive "fairness" is decreased (hopefully only slightly).

FIFO semantics are not broken, because registration is unobservable in the current queue API. So in this example:

q.take.start >>
IO.sleep(x) >>
q.take.start

The two takes are actually racing. There is no ordering between the 2 registrations, as registrations are not observable. (Sure, as x grows larger, there is a higher probability that the take in the 1st line is registered before the one in the 3rd line.) So I agree, that it's probably fine.

An idea: to (probably) further descrease this fairness, and (maybe) increase performance, the queue in #2885 could use concurrent bags instread of queues for the waiters (but not for buffer). But it may not worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants