Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Agent spec must address agent failure (was: Memory model must include partial failure) #55

Closed
erights opened this issue Jan 29, 2016 · 26 comments
Labels

Comments

@erights
Copy link

erights commented Jan 29, 2016

Presumably, some expected implementation of multiple vat/agents communicating through a shared arraybuffer will be in terms of multiple processes communicating through a shared arraybuffer. Processes are also a unit of separate failure, so a process holding locks may preemptively disappear.

Since the memory those locks were guarding must be assumed in an inconsistent state, those locks cannot be released. This potentially infinitely blocks counter-party processes. These issue should be made explicit and discussed.

@lars-t-hansen
Copy link
Collaborator

Section 8.1 "Agent mapping" in the spec discusses the issue to some extent. Partly it does this by suggesting that we not allow process separation (obviously debatable, and it should be debated). It also notes that in-process workers can be killed off and that this will leave memory in an inconsistent state as you suggest, without process separation.

Adding a notion of partial failure elsewhere might cover this nicely. (I'm not sure yet that it needs to be in the memory model, although I see the arguments why it should be.)

@lars-t-hansen
Copy link
Collaborator

Reacting only to the aspect about whether this needs to be in the "memory model", I think the memory model already has what it needs (not to say that that's all that's needed in the spec text).

We don't have a primitive notion of locks, only of atomics and waits, and those are orthogonal ideas, so there's no way to say that an agent termination event should cause waiters on the terminated agent's owned resources to somehow be signaled. The "owned resources" in this case are only a shared cell that the lock holder will signal on with futexWake(), a fact known only to the agents agreeing to use that cell as a signaling location. As the world stands today, if an agent that holds a "lock" is killed the agents that wait for that lock will wait forever.

That is exacerbated by bugs / missing features in the Workers spec, which I have outlined here: https://www.w3.org/Bugs/Public/show_bug.cgi?id=29039. One of these bugs is an inability to observe directly whether a worker is still alive. (One can imagine a clever heartbeat system or use of timeouts but it sounds brittle.)

If an agent is killed while holding a lock there are two (application-specific) possibilities: One, the system is hoplessly compromised and we must shut it down. Two, the lock can be safely broken and the system can continue.

In the second case, there is already the necessary happens-before relationship between the dead agent's writes to the protected variables and the observation in some other agent that the first agent has died, via point 4 of the happens-before relationship and the browser extensions to happens-before. The observer effectively inherits the lock in a race-free manner and can perform cleanup and unlock.

(The observer must still observe that that lock is held by the dead agent, which requires some kind of recording mechanism in the lock, but that's a separate matter.)

@erights
Copy link
Author

erights commented Feb 27, 2016

I notice in the spec that SABs, unlike ABs, cannot be detached. I can imagine several kinds of termination relationship that an agent may wish to have to an SAB, either in general or possibly during sensitive periods:

  • Client-protecting detachment: If I preemptively disappear (possibly during sensitive periods), then I want this SAB as held by others to become detached, letting them know that I am dead and protecting them from further access to this data.
  • Self-protecting detachment: If any of the agents sharing this SAB preemptively disappears, then I want my access to this SAB to become detached, letting me know that one of them is dead, and protecting me from confusing myself.
  • Fail fast: If my access to this SAB disappears, I should be preemptively terminated rather than observe the SAB in a detached state. (Erlang-on-JS may opt for this.)

The first two bullets raise the possibility of an SAB becoming detached during a futexWait. The third bullet aggregates agents into fail-together groups, where these group might grow with further SAB sharing.

@lukewagner
Copy link

The first two bullets would I believe have pretty major implementation/performance/security hazards. It means that, between any two instructions (in JIT code, in C++ code iterating over a typed array view, etc), a SAB could have become detached. Just the possibility of an AB getting detached synchronously by reentering JS which postMessage()d caused a number of browser security bugs a few years ago. It seems like implementations would be forced into various complicated schemes to avoid hurting performance too much and this would require audits of all AB/TA-accessing code to detect this new kind of racy detachment.

The last bullet seems possible to implement by reusing the hung-script-termination mechanism. However, since that path is expected to be rare/catastrophic, it can be pretty slow and expensive. Also, as I understand it, some engines rely on simply killing processes to kill hung-scripts and so don't support the granularity necessary to precisely terminate a fail-together group.

@erights
Copy link
Author

erights commented Mar 2, 2016

Without a mechanism like this, I still do not understand how SAB proposes to contain inconsistency following the preemptive termination of one of the participants. Please clarify. Thanks.

@lukewagner
Copy link

I should make sure I understand what "preemptive termination" means: are we only talking about events that come from "outside" of an app, like the slow-script killer and OS OOM killer (so specifically not worker.terminate())? In that case, then it probably already makes sense, even before SAB (since you don't need shmem to get into a globally-inconsistent state when one of your workers is preemptively terminated), to preemptively kill "everything reachable" (the fail-together group). For worker.terminate(), though, since the application requested the termination, it could still be in a consistent state (if proper synchronization was used) and any additional preemptive termination would be unwelcome. (This is symmetric to the situation with TerminateThread/pthread_kill except they're even more dangerous.)

@erights
Copy link
Author

erights commented Mar 2, 2016

@lukewagner Where is the discussion of the before-SAB fails-together group? What definition of "reachable" does it use when it talks about "everything reachable"?

@lars-t-hansen
Copy link
Collaborator

Mark, in your introductory comment you bring up the issue of "locks" that are held by the crashed agent, but there are no locks in this proposal (as I wrote in an earlier comment). It is true that an agent that is in the middle of what it perceives to be a critical section will use non-synchronized stores on locations that, if accessed by another agent after the first agent has crashed will cause a race condition, but the general situation would be that the accessing agent would have to observe the termination of the former agent first, and that would provide the necessary edge in the happens-before relation to avoid the race. But I suspect that's not what you're getting at.

Can we perhaps break this down into various concrete cases in the terms of the proposal?

For example, suppose an agent crashes while it is inside a critical section and thus fails to reset a flag on which the other workers are futexWaiting. The other agents are now hung, and there are no facilities built into the system to unhang them. Either the system deadlocks (or is terminated altogether by unspecified means) or there is some worker (maybe the main thread) that can observe the crashed agent. The observer now has a choice: unblock those other agents or not? If it does, they may go into the critical section, which could be bad if we can't clean up the state properly. But if this your concern? Then perhaps the "locks" that are built around futexWait need to account for the need to terminate?

Please post some concrete scenarios that we can talk about.

@erights
Copy link
Author

erights commented Mar 2, 2016

Actually @lars-t-hansen that breaks it down rather well. Thanks.

Then perhaps the "locks" that are built around futexWait need to account for the need to terminate?

What would this look like? You have a good library of example higher level abstractions built on futex (and presumably buildable on synchronic). Could all of these practically be modified to wake on termination of counterparty without indicating that it is safe to proceed to access memory that is not actually in a consistent state? All those that timeout already have a way to wake their thread without falsely indicating safety, so can we reduce this to that previously solved problem? How would they even notice termination of another participant? Doesn't sensing another's termination already require some additional mechanism?

If "accounting for the need to terminate" is generally practical, and we genuinely expect these higher level abstractions to do so, then that is probably sufficient for my concern.

@lukewagner
Copy link

Are all these references to "terminate" referring to external (browser) causes (hung-script killer) or from the app itself via worker.terminate()? It seems to me this makes a big difference in that, in latter case, the app can design for and handle this intentional termination; the former case seems very hard to recover from gracefully, in general (again, even without SAB).

@lars-t-hansen
Copy link
Collaborator

There is a known (to me, anyhow :) missing piece of the spec, which has to do with inter-thread signaling. I've held off on proposing anything because I've assumed that it would be better to build that mechanism on top of the more primitive ones we provide. It should not be difficult to create a lock whose acquire function checks, once it's inside the critical section, whether the lock state (or some global flag) says to unlock and throw. Is that sufficient for what we need? I'm not sure, it's probably application-dependent, but it is how our pthreads implementation handles asynchronous signals, see eg https://github.com/kripken/emscripten/blob/07b87426f898d6e9c677db291d9088c839197291/system/lib/libc/musl/src/thread/__wait.c.

(On that general note, the more I use these mechanisms the more I tend to construct elaborate functionality around them that is easier to use than the raw mechanisms.)

@lukewagner, initially Mark brought up process separation, ie, it could be the OS doing the killing.

(Also see my ongoing crusade to get some kind of worker state introspection into html.)

@erights
Copy link
Author

erights commented Mar 2, 2016

(Also see my ongoing crusade to get some kind of worker state introspection into html.)

Also see where?

@lars-t-hansen
Copy link
Collaborator

See my earlier comment on this bug, notably the link therein to the w3c bug tracker.

Somebody suggested to me that the non-available introspection APIs were are result of resistance to GC observability, which is somewhat plausible, but without a way to know what's going on with workers -- whether they were created OK, whether they have terminated -- using them for serious work will be hard.

@lukewagner
Copy link

Agreed, I was lumping "browser hung-script killer" and "OS killing" into the same bucket above as external sources that terminate your workers (as opposed to worker.terminate()) saying that if the browser/OS kills one of your workers, the browser should conservatively kill the whole app (handwaving at "app"). But maybe that's too extreme since maybe the rest of the app would otherwise be able to recover (ignoring the dead SAB) so I now understand the use case above: how do we allow this group to recover at all. I still think racy detachment is super-hazardous for the entire browser so I hope a different strategy could be found that is confined to a few APIs.

Where is the discussion of the before-SAB fails-together group?

I've seen concepts like "constellation" mentioned before in the context of asking what windows to stop when the user stops a slow script (e.g.) and I think different browsers do quite different things here so maybe the definition established for SAB of a fails-together group would make sense to answer this stop-script question.

@lars-t-hansen lars-t-hansen changed the title Memory model must include partial failure Agent spec must address agent failure (was: Memory model must include partial failure) Mar 3, 2016
@lars-t-hansen
Copy link
Collaborator

I changed the title of this bug because I feel like we're iterating to a point where there's agreement(?) that the memory model per se probably has the necessary machinery to deal with agent failure but the agents spec needs to address the consequences of agents being terminated suddenly by external forces. That is section 1.7 of the agents spec, which so far just reads "TODO".

@lars-t-hansen
Copy link
Collaborator

In Issue #27 @annevk posted a link to an illuminating W3C discussion thread and a more recent bug report, in both cases pertaining to the difficulty of signaling agent termination properly, as well as how that interacts with GC. The focus is on message ports, which are more general than shared memory communication, but it really comes down to how an agent learns about the failure of another so that it can take action.

In that context, even defining "termination" is tricky. It might be the OS killing a process containing a tab, but it could equally well be user navigating away from a tab that owns one endpoint of a message channel, where the tab at the other end will be left waiting for a message that never comes because the sender disappeared. In the latter case it is in principle possible to catch some onunload event and send a message from the expiring tab, but Smart People describe this as Very Tricky, and a system-provided solution is called for but none has been found so far that I can tell.

I don't think my ambition extends to solving this problem on behalf of W3C.

At the moment, the agents spec says that if an agent in a cluster is suspended by the embedding then the entire cluster must be suspended. This pretty much confines an agent cluster to a tab plus its dedicated workers, but they could all be in independent processes, and even if not they could be in independently accounted heaps with an in-process OOM killer taking one of them, so termination is a different kind of thing than suspension.

Really the needs in the case of shared memory are exactly the same as the needs of non-shared memory, as discussed in the thread referenced above: if a worker is nuked, somebody needs to be told, or the program will become unreliable. The shared memory spec already has language that establishes happens-before if an observation is made that an agent in the cluster has terminated, and that is probably all it needs to say. Whether notification is made is the concern of the embedding.

EDIT: It is the shared memory spec that establishes the happens-before relation, not the agents spec (last paragraph).

@lars-t-hansen
Copy link
Collaborator

Prose added to the section on agent clusters:

    <p> An embedding may terminate an agent without any
      of the agent's cluster's other agents' prior knowledge or cooperation.  If the
      embedding terminates an agent it must make it possible for any other agents
      in the terminated agent's cluster that continue to run to discover the
      termination. </p>

    <emu-note>
      <p> Examples of that type of termination are: operating systems
        or users terminating agents that are running in separate
        processes; the embedding itself terminating an agent that is
        running in-process with the other agents when per-agent
        resource accounting indicates that the agent is runaway. </p>
    </emu-note>

    <emu-note>
      <p> (Spec draft note) The requirement that the embedding must
        make the termination known to other agents is actually very
        soft.  The shared memory spec will require that detecing
        termination creates the necessary happens-before edge in the
        memory ordering, which is a tougher requirement.  The web
        platform provides nothing at the moment to detect
        termination. </p>

      <p> In current browsers, given that dedicated workers are
        in-process, an agent cluster will normally terminate en masse,
        so the requirement is trivially satisfied for now. </p>
    </emu-note>

@domenic @erights @annevk - any remarks?

@domenic
Copy link
Member

domenic commented Mar 10, 2016

If the embedding terminates an agent it must make it possible for any other agents in the terminated agent's cluster that continue to run to discover the termination.

This really threw me off, until I saw the spec draft note. I think this should be rephrased to something like "The SharedArrayBuffer type provides the ability for any other agents in the agent's cluster that continue to run to discover the termination", perhaps with some example code.

In general imposing requirements like this on the host environment's APIs seems way out of scope for ES.

@erights
Copy link
Author

erights commented Mar 10, 2016

In general imposing requirements like this on the host environment's APIs seems way out of scope for ES.

I don't think it is out of scope, but neither do I think it should be treated as a host-specific issue. We should address the means of failure containment here. The issue is, what is the minimal contagion unit of preemptive failure. Naively, we could make the following analogy:

Agent == thread
Agent Cluster == process

where we assume that processes do not directly share memory with each other. If we buy this analogy, then the conclusion is that the agent cluster is the fail together unit. If we couple agents within a cluster through SABs such that one agent preemptively failing must be assumed to leave the others in an inconsistent state, then this unpleasant conclusion would be correct.

Erlang shows the great benefit of programming with small units of failure. Erlang does it by "shared nothing" which, by definition, does not apply to agents sharing an SAB.

However, coupling disjoint heaps through SABs is still much looser coupling than, say, the shared memory multithreading between Java threads, which share the whole heap. Java has no choice but to have the entire process fail together. (The deprecation of Thread.stop http://docs.oracle.com/javase/6/docs/technotes/guides/concurrency/threadPrimitiveDeprecation.html was an admission that even sudden program-requested preemptive termination could never be made to work for less than a process.) Because our agents have disjoint heaps, we have a much better chance of not requiring an cluster to be killed if any of its agents are killed. But only of we address this issue.

There is nothing browser-specific about this issue. It will arise in any platform supporting multiple agents, SAB, and wanting to contain failures to smaller units.

@lars-t-hansen
Copy link
Collaborator

In general imposing requirements like this on the host environment's APIs seems way out of scope for ES.

I was extremely hesitant to introduce it, but I do note that the previous normative paragraph imposes just such a requirement when suspending an agent. (That's the paragraph you described as "very interesting" in Issue #27, perhaps I did not fully grok the meaning of your comment :)

@lars-t-hansen
Copy link
Collaborator

A few more things.

The SharedArrayBuffer spec does not actually provide any kind of termination determination mechanism. Technically, it doesn't even require one to exist. It requires that if one exists then it exists within the embedding so that there can be clear causality between (agent A terminates) and (agent B discovers that agent A has terminated) so that B can avoid all races on any locations that A wrote before it terminated.

The current HTML spec does not address failure containment, yet "failure" happens even so (see the mail thread I cited earlier) and is a problem in practice. That has nothing to do with shared memory, and in fact the problems people are citing are much worse than problems we might have within an agent cluster, which is tightly coupled.

I find myself on the fence here. On the one hand I think that the embedder needs to provide a mechanism to signal termination, hence the draft language. On the other hand I can't really enforce that, no matter what I do, unless I spec a concrete API for it in JS. Assuming that act would even pass muster, in defining an API I almost certainly have to talk about representations of agents, which I don't want to do, because I don't want to mandate them. I can't even mandate that there's an API for creating or naming agents; there could be a command line switch to create n of them, with a predefined SAB available to all of them and no way to share any more SABs. I can maybe mandate that there is a signal, but I can't mandate that that signal carries more than one bit ("somebody crashed") without getting into embedding specifics.

Because our agents have disjoint heaps, we have a much better chance of not requiring an cluster to be killed if any of its agents are killed. But only of we address this issue.

I agree with that.

There is nothing browser-specific about this issue. It will arise in any platform supporting multiple agents, SAB, and wanting to contain failures to smaller units.

I agree with that also, but I'm starting to believe that all solutions are embedding-specific.

@lars-t-hansen
Copy link
Collaborator

Maybe this all leads to something like this:

"If an agent is terminated not by programmatic action of its own or of another agent in the cluster but by forces external to the cluster, then the embedding has two choices: Either terminate all the agents in the cluster, or provide reliable APIs that allow the agents in the cluster to coordinate so that at least one remaining member of the cluster will be able to detect the termination, with the termination data containing enough information to identify the agent that was terminated."

(Even that's pretty high-level still but it is better than the one-bit signal. And the requirement of causality in the shared memory spec comes in addition.)

In the context of HTML that could be satisfied by the suggested "channeldropped" event on a message channel, modulo whether that suggestion is well-defined.

It could also be satisfied by lighter-weight mechanisms that are new. For example, passing a SAB + a location within the SAB to the Worker constructor with the intent that the embedding sets the byte at that location if the worker is terminated would be adequate, if clunky to use. One is reminded of wait(). In the context of the dedicatedThread proposal these parameters would be passed as additional named parameters (waitBuffer, waitLoc).

Until such mechanisms exist browsers would just terminate the cluster when an agent is terminated, which is what they do now.

The word "reliable" in the paragraph above means that it is not sufficient to rely on the new agent to perform any action to set this up, as the agent could crash on the first instruction executed.

@lars-t-hansen
Copy link
Collaborator

Added the proposed revision (previous comment) to the proposal.

@lars-t-hansen
Copy link
Collaborator

@erights, @domenic, any reactions on the language quoted above that I also added to the spec? In my mind it strikes a balance between providing some guarantees (about behavior) that I think we should be providing, on the other hand it leaves space for embeddings to do different things, and in particular, things would "just work" in current browsers.

(Maybe something to discuss at the meeting later this month.)

@lars-t-hansen
Copy link
Collaborator

I believe the current spec adequately addresses this issue.

@quirinpa
Copy link

I'm trying to port Linux and I'm still single threaded.
So the program goes SYS_Exit, the kernel handles it, and says hugs, we're mid-stack.
The developer commits what will never merge // for (;;)
Unreachable - gotta catch them all.
qnixsoft.com/cgit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants