Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifier observers to discriminate between vat failure and upgrade #5185

Closed
mhofman opened this issue Apr 21, 2022 · 43 comments · Fixed by #7401
Closed

Notifier observers to discriminate between vat failure and upgrade #5185

mhofman opened this issue Apr 21, 2022 · 43 comments · Fixed by #7401
Assignees
Labels
enhancement New feature or request notifier SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@mhofman
Copy link
Member

mhofman commented Apr 21, 2022

What is the Problem Being Solved?

We need a mechanism for notifier observers to discern a failure caused by a vat termination (which should result in a permanent failure), and a failure caused by a vat upgrade (which should result in a retry).

By happen-chance right now we're been using an error instance for the former, and an error string for the latter. This is however very brittle, and a drive-by change (almost?) made the string an error: https://github.com/Agoric/agoric-sdk/pull/5130/files#r854732408

Description of the Design

Some more stable and documented way to discern the cause of failure, and whether a retry should be attempted.

Security Considerations

Test Plan

There must be tests and documentation for this behavior.

@mhofman mhofman added the enhancement New feature or request label Apr 21, 2022
@warner warner added SwingSet package: SwingSet notifier labels Apr 21, 2022
@warner
Copy link
Member

warner commented Apr 21, 2022

This overlaps with SwingSet because that's where vat failure-vs-upgrade originates. But the protocol definition and implementation is squarely in packages/notifier. We'll change swingset to emit whatever seems most useful for Notifier.

@warner
Copy link
Member

warner commented Apr 21, 2022

What I just heard from @FUDCo is that @erights agreed to use reject('vat upgraded') as the very specific signal that the notifier's observe should re-try the getUpdateSince, and that any other rejection should be considered a non-retryable failure. This means that the specific vat upgraded string is now part of the swingset API.

@mhofman
Copy link
Member Author

mhofman commented Apr 21, 2022

I'm ok with that as long as it's documented somewhere, preferably including at the place where the string is thrown.

As I mentioned in the PR, I believe we should have a single const where the special string is defined, with a comment explaining why it's so, and that it must not change without an update to code relying on notifiers.

@mhofman
Copy link
Member Author

mhofman commented Apr 21, 2022

I also would prefer if notifiers used a helper like isUpgradeError so that we don't need to track down everywhere such test may happen.

@FUDCo
Copy link
Contributor

FUDCo commented Apr 21, 2022

I also would prefer if notifiers used a helper like isUpgradeError so that we don't need to track down everywhere such test may happen.

Something like that would be good, but not with that name. As @erights points out, the reason for not using an Error is that it's not an error.

@FUDCo
Copy link
Contributor

FUDCo commented Apr 21, 2022

Also, if we had used Error, you'd still have the same issue because you'd need to look at the message inside the error to see if the rejection was caused by an upgrade.

@warner
Copy link
Member

warner commented Apr 27, 2022

@gibson042 to modify the two client-side helper functions in packages/notifier to understand this protocol ("rejected with vat upgraded").

@dtribble says: if the first request to this notifier rejects with the "vat upgraded", that's probably a deeper error, and we should stop. @erights says what? @dtribble says: if you didn't get the promise yourself...

  • I ask for a notifier
  • I get a promise that (hopefully) resolves to the notifier
  • I hand that promise to the iterator

But: if the promise actually gets rejected (with vat upgraded), that will cause a loop. So we need to distinguish between whether we successfully got a notifier or not.

@erights says: rejected promise contagion can confuse. Maybe getting a rejection reason through contagion is an error.

@erights
Copy link
Member

erights commented Apr 27, 2022

The rest of the post below is of historical interest only. While writing it up, I realized that it is fatally flawed, which I'll explain in the next message.

I suggest that the normal promise rejection contagion guarantee that the propagated rejection reason always be an error independent of whether the original rejection reason is an error.

consider:

const notifierP = //...whatever returns an unresolved remote promise
const recordP = E(notifierP).getUpdateSince(....);

If recordP rejects with the string 'vat upgraded', we want to treat that as a non-erroneous signal that the notifier client should adapt to the notifier's vat being upgraded. However, if notifierP rejects with the string 'vat upgraded', then the normal rejection reason contagion would also reject recordP with the string 'vat upgraded'. However, the state of recordP should actually be diagnostic of something erroneous happening. And the purpose of rejection reason contagion is to propagate error diagnostic.

So, I suggest the correct solution is at a low abstraction level: That the rejection reason contagion mechanism should guarantee that the derived rejection always be an error, because any derived rejection itself should be considered erroneous.

The rest of the post above is of historical interest only. While writing it up, I realized that it is fatally flawed, which I'll explain in the next message.

@erights
Copy link
Member

erights commented Apr 27, 2022

The fatal problem is that blind rejection reason contagion is too deeply a part of existing standard JavaScript semantics that we absolutely cannot change. Consider .then, await, and Promise.all. These propagate whatever the rejection reason is, and our E semantics has to work intimately and consistently with it.

@mhofman
Copy link
Member Author

mhofman commented Apr 27, 2022

First, I thought that when sending a message to an errored promise, the resulting error was already a new error generated either by Swingset or liveslots/HandledPromise?

Second, I thought @dtribble's point was that the update result of getUpdateSince from the producer side may itself be a promise rejected with the string 'vat upgraded'. I suppose that becomes a question of how the notifier API handles rejected promises as values. Are they wrapped as a success result (like a sync iterator yielding a {done, value} record), or is a promise value somehow unwrapped like in the case of async-from-sync-iterator ?

Edit: sorry I am wholly unfamiliar with our notifiers.

@dtribble
Copy link
Member

From my perspective, this is the promise transitioning to a "disconnected" state. But that's not worth trying to make a distinct state in the promise. Instead, make the rejection distinct: make a unique Disconnection object that represents this disconnection event, or have it be a record with a unique value. Then, a client that makes a retry can tell if that retry fails as a result of the same disconnection event. This is pretty simple, easy to understand, and solves the retry problem. If you can use object identity here, then it doesn't require any unique data. Otherwise, it might require a kernel-provided data value.

@mhofman
Copy link
Member Author

mhofman commented Apr 27, 2022

Ah so mint a brand new object with identity and reject promises of that specific upgrade with that. If the client tries twice and gets that unique object twice as rejection in a row, consider that the notifier is permanently borked?

For identification purposes, we could wrap the unique object in copy data, e.g. {upgraded: uniqueUpgradeToken}

@dtribble
Copy link
Member

First, I thought that when sending a message to an errored promise, the resulting error was already a new error generated either by Swingset or liveslots/HandledPromise?

When you send/then to a rejected promise, the result is a promise rejected with exactly the same rejection value, not a newly generated thing.

Second, I thought @dtribble's point was that the update result of getUpdateSince from the producer side may itself be a promise rejected with the string 'vat upgraded'.

The result of getUpdateSince is the long-lived promise that will get disconnected. So in this scenario, it does get rejected with 'vat upgraded'. The problem is what if the notifier itself is a rejected promise. I've seen code that created a notifier as the result of requesting a notifier from somewhere else (e.g., a regular quote notifier just wraps an underlying timer notifier). Thus it is possible to have the promise for the source notifier become rejected with vat upgraded, or of course an attacker could do it. In that case, every call to gettUpdateSince would get rejected with vat upgrade, causing a simple handler to recursively call getUpdateSince on the "notifier", which get rejected,....

I suppose that becomes a question of how the notifier API handles rejected promises as values.

So this is not about the notifier value at all. The result of getUpdateSince is a record provided by the notifier with { updateCount: n value: v} where the value sent is in the value field. So clients cannot induce this behavior.

@dtribble
Copy link
Member

Ah so mint a brand new object with identity and reject promises of that specific upgrade with that. If the client tries twice and gets that unique object twice as rejection in a row, consider that the notifier is permanently borked?

For identification purposes, we could wrap the unique object in copy data, e.g. {upgraded: uniqueUpgradeToken}

Yep. Unless of course the object identity doesn't propagate usefully across boundaries. I had long ago propose an extension for that, but having it be a presence for the upgrade event would work. Or unique data.

@erights
Copy link
Member

erights commented Apr 27, 2022

I do not like the idea that disconnected is an unforgeable state. Whether it is a string, or an forgeable object that is testably different than a normal error, or even a special error that is testably different than a normal error is still an open question. If so, we should still ask, what does it do under contagion?

@mhofman

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@erights

This comment was marked as off-topic.

@mhofman

This comment was marked as off-topic.

@mhofman
Copy link
Member Author

mhofman commented Apr 28, 2022

I have moved the conversation about rejection contagion in eventual-send to endojs/endo#1176

@dtribble
Copy link
Member

What do you think of adding such automatic error wrapping

Requiring that precludes tail recursion optimization because each layer must be prepared to wrap a failure with more information.

@mhofman
Copy link
Member Author

mhofman commented Apr 28, 2022

@dtribble is that comment related to notifiers, or to the broader question of rejection contagion?

@erights
Copy link
Member

erights commented Apr 29, 2022

It certainly is a valid counterargument against wrapping during rejection contagion. Good catch!

@mhofman
Copy link
Member Author

mhofman commented Apr 29, 2022

I'd prefer discussing the rejection contagion in endojs/endo#1176, but I don't see how tail recursion causes any issue. You can always redirect the result promise of the delivery to the result promise of the tail send. We're only talking about wrapping an error in the resolution of the send target, not the whole send.

Equivalent userland code, which can be tail-optimized in the kernel:

const applyMethod = async (x, p, args) =>
  E.when(
    x, 
    (r) => r[p](...args),
    (e) => Promise.reject(new Error(`Cannot call ${p} on rejection`, { cause: e })),
  );

@erights
Copy link
Member

erights commented Apr 30, 2022

I believe @dtribble doesn't mean literal tail recursion, but rather the accumulation of storage proportional to the depth of the recursion. If Alice says

const p = E(bob).foo(carol);

invoking Bob's foo method, which reads

   foo: carol => E(carol).bar();

then currently, the promise for the result of foo is forwarded to the promise for the result of bar. If promise forwarding chains are collapsed by the implementation (admittedly a big ask), then the foo call has contributed no non-garbage to the bookkeeping for the return result. "Tail recursion for the heap".

@dtribble am I representing your concern accurately?

@mhofman
Copy link
Member Author

mhofman commented Apr 30, 2022

@dtribble @erights answered at endojs/endo#1176 (comment)

@Tartuffo Tartuffo added this to the Mainnet 1 milestone May 2, 2022
@warner
Copy link
Member

warner commented Jul 12, 2022

Please coordinate with #5649 , which will change the upgrade-provoked rejection value to be a distinctive object (not an Error).

@Tartuffo Tartuffo removed this from the Mainnet 1 RC0 milestone Sep 21, 2022
@rowgraus rowgraus added vaults_triage DO NOT USE and removed MUST-HAVE labels Dec 20, 2022
gibson042 added a commit that referenced this issue Apr 12, 2023
…grade

subscribeEach iterators continue to fail in that scenario,
because they cannot guarantee absence of gaps.

Fixes #5185
gibson042 added a commit that referenced this issue Apr 13, 2023
…grade

subscribeEach iterators continue to fail in that scenario,
because they cannot guarantee absence of gaps.

Fixes #5185
@gibson042
Copy link
Member

Some interesting questions are being raised by the current state of #7401:

  1. When retrying after getting a vatUpgraded rejection results in a new rejection, how much information can we provide? I'm currently adding a X`Attempting to recover from disconnection: ${initialDisconnection}` note to the new rejection as suggested by @mhofman, which seems to be the best that can be done because AFAICT it is not possible to detect a "splat" that we'd want to replace with the original vatUpgrade rejection (which is actually a similar problem to detecting vatUpgraded in the first place, where rejection is a {name: 'vatUpgraded', upgradeMessage: string, incarnationNumber: number} object rather than an Error instance in order to convey structured failure information across vat boundaries).
  2. What should be done about gap-free subscribers such as subscribeEach iterators? It is not possible to uphold the gap-free guarantee in every case, but would be possible to detect the absence of a gap before a non-failure publication record (and presumably allow the subscriber to remain active) if we commit to publishCount starting at 1n and always incrementing by 1n.

@michaelfig
Copy link
Member

Number 1 looks good. ISTR the reason in assert.note(reason, X...) must be an Error, but I could be wrong. (I have not yet looked to see if your code is tolerant of that constraint, or unit tests somewhere prove that it isn't a constraint.)

As far as 2, I like your suggestion: we can and should "commit to publishCount" semantics that you described there.

gibson042 added a commit that referenced this issue Apr 16, 2023
…grade

subscribeEach iterators continue to fail in that scenario,
because they cannot guarantee absence of gaps.

Fixes #5185
gibson042 added a commit that referenced this issue Apr 25, 2023
…grade

subscribeEach iterators continue to fail in that scenario,
because they cannot guarantee absence of gaps.

Fixes #5185
@mergify mergify bot closed this as completed in #7401 Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request notifier SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
9 participants