Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: triage await safety #6219

Closed
wants to merge 2 commits into from
Closed

fix: triage await safety #6219

wants to merge 2 commits into from

Conversation

erights
Copy link
Member

@erights erights commented Sep 16, 2022

Jessie warns of violations of https://github.com/Agoric/agoric-sdk/wiki/No-Nested-Await . Before this PR, a yarn lint of agoric-sdk generated 188 of these warnings. This PR triages those warnings. 10 remain warnings and should be fixed subsequent to this PR, preferably by someone who better understands the code these occur in.

For each warning, I either

  • for the 10 that still need to be fixed, I added a TODO FIXME comment and filed one of the issues listed below.
  • turned the rule off wholesale for modules in packages that likely don't need to be assert-safe.
  • fixed the code so that it obeys the No-Nested-Await rule
  • fixed the code so it is actually safe for reasons I explain, even though it still violates No-Nested-Await.
  • realized the code was already safe, sometimes for subtle reasons I explain, even though it violates the No-Nested-Await rule.

For each explanation I inserted, I tried to state the principle I found myself using to figure out that the code is safe. For each of these, I coined a term for this principle, such as "terminal-control-flow", and explain it well the first time I encounter it. Other times, I use the name of the principle, explaining only why it applies to that individual case.

Of my fixes, if they were about correctness, I have filed one of "Fixes" issues listed below that also raises the possibility of cherry picking that fix, so that it could go in ahead of the rest of this PR.

Reviewers, after this PR goes through its first sanity check by you, based on your feedback I will then collect an explanation of the surviving principles and put them someplace more reusable, like our wiki.

Everyone, please look at my changes to the code you're most familiar with to see if I broke anything. I tried to be very careful not to, but... . Also, please see if my explanations of why your code was safe, or with my mods is now safe, make sense to you. Even better if you could send me less convoluted replacements than mine for your current code, for example code in which we no longer need to suppress the Jessie await warning.


Does not yet fix #6220 , but I expect to include it before merging if it has not already been fixed.

Still needing analysis, but could be addressed in separate PRs:
#6226
#6227
#6228
#6229
#6230
#6231
#6232
#6233
#6234
#6235

This PR would fix, if not already fixed by cherry picking from this PR:
Fixes #6237
Fixes #6239
Fixes #6240
Fixes #6242
Fixes #6243
Fixes #6244

Copy link
Member

@dckc dckc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at a handful of files as noted.

Comment on lines +173 to +188
// `ensureVatOnline` is an async function, so this turn boundary always
// happens at the end of each iteration. However, the control-flow is
// still unbalanced. If the loop iterates zero times, we proceed to
// the next code synchronously. Else we proceed asynchronously.
// However, the next loop is terminal, and each iteration of that loop
// resemble each iteration of this one. Considered together, in all
// cases the first iteration of one of these loops will happen first,
// if there are any. And if there are more iterations from either loop,
// a turn boundary separated each iteration from the next.
// eslint-disable-next-line no-await-in-loop, @jessie.js/no-nested-await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with the justifications in this file, but yes let's do refer to the full explanations by reference rather than having them inline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/swing-store/src/snapStore.js Outdated Show resolved Hide resolved
packages/vats/src/vat-bank.js Show resolved Hide resolved
// flow branch, and is top level within that branch.
// eslint-disable-next-line @jessie.js/no-nested-await
const array = await handleCommand(message.subarray(1));
// This nested await is ALMOST safe because "terminal-control-flow".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paging @kriskowal . PTAL.

Copy link
Member

@dtribble dtribble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments so far

@@ -85,6 +85,8 @@ export default async function cosmosMain(progname, rawArgs, powers, opts) {
}

if (popts.pull) {
// This await is safe because "terminal-control-flow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree here. Since the line below the if is a recursive call, if opts.dockerTag && ! popts.pull then this will infinite loop within a turn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not something an attacker can supply, so this is low risk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6235

@@ -197,9 +197,22 @@ const start = async zcf => {
trace('exiting processTranches loop');
return;
}
// This await almost certainly does not do what was intended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang it! That seems correct. How would it get integrated so that a long-lived generator is pulled on and awaited once per block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6220

await oncePerBlock();

// Determine the max allowed tranche size
//
// This nested await is safe because "terminal-control-flow".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all part of the related design pattern of using for for await and an async generator. We should have one comment to that effect and then suppress the overall nested await check in this file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we turn the current Jessie warning off for the whole file, then we won't notice if an await slips in that isn't safe for this reason. Until we have more static check distinctions, I think we should still suppress them individually, citing the safety principle for each one. But obviously, we can pull out all further commentary.

// It occurs at top level of the body of a top level terminal
// while. But do note that the enclosing function is an asynchronous
// generator function and that there is a `yield` within the loop.
// I don't think that causes any additional worries, but I have not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it does not. This must be used within a for await, which awaits at the top of the loop (and therefore at each yield). So a yield i effectively an await.

@@ -466,7 +466,16 @@ const helperBehavior = {

// The rest of this method will not happen until after a quote is received.
// This may not happen until much later, when the market changes.
// eslint-disable-next-line no-await-in-loop
//
// This nested await is safe because "terminal-control-flow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should refer to the for/await/generator/while pattern that is used in several places for vaults.

//
// Some code does execute after this one, synchronously or asynchronously.
// That would normally be unsafe, but all that code is clearly not stateful.
// Its correctness is independent on timing.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Its correctness is independent on timing.
// Its correctness is independent of timing.

await oncePerBlock();

// Determine the max allowed tranche size
//
// This nested await is safe because "terminal-control-flow".
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we turn the current Jessie warning off for the whole file, then we won't notice if an await slips in that isn't safe for this reason. Until we have more static check distinctions, I think we should still suppress them individually, citing the safety principle for each one. But obviously, we can pull out all further commentary.

Copy link
Member

@dtribble dtribble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one element that I think should NOT be checked in, and a few alternate suggestions. I reviewed all changes, except for the "IIUC" examples in e.g., solo directories.

// this statement and proceeding syncronously to the next
// iteration. The await at the top of the loop ensures that the
// remainder of the loop only happens after a turn boundary.
// The only remaining unsafety is that the awaited expression of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other branches are terminal (return or throw) and so this unsafety does not actaully occur here.

// loop body safer. A `for await of` will correctly process a synchronous
// iterator, but it will unconditionally separate the iterations with
// with turn boundaries.
for await (const step of steps) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially adds a a lot of awaits. (2 additional awaits per step). While they are cheap individually, steps could I think be quite large if e.g., you are trying to catch up on a restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading vats on replay is a serialized process, and so I think the risk of the current code is low. I recommend NOT including this change without a lot more testing than we can perform at this time. And so it shoudl just wait until it gets refactored to not need this kind of layering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6242

// IIAFE, we ensure that the `unlink` can only be reached asynchronously.
//
// eslint-disable-next-line @jessie.js/no-nested-await
await (async () => rename(tmpFilePath, target))();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler might be:

await null;
await rename(tmpFilePath, target);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics are more different. Might be ok in this case, but...

The original code always calls rename in the same turn that was already executing. For all I know that atomicity may be important. My change preserves that. Your's puts an interleaving point before the rename.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, there is an earlier await almost immediately above it. In this case, your extra await would only break up the atomicity with the resolve which seems unlikely to cause a problem, but would still need looking into.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6237

// `if` is not terminal because of the stateful `close()`s afterward.
// On the else branch, those `close()`s would happen synchronously.
// eslint-disable-next-line @jessie.js/no-nested-await
await (async () => destination.sync())();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above. This seems too mysterious. I'd live with it if it were more efficient though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, in this case it happens to be safe to introduce a new interleaving point before the sync() call, because the immediately preceding stateful code is another await. But it is not a technique that works in general, so we should be careful about making it a habit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6239

Copy link
Member

@turadg turadg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The await safety triaging seemed to have enough review so I reviewed how it's communicated.

I will then collect an explanation of the surviving principles and put them someplace more reusable, like our wiki.

+1 to having an index of the principles to reference. Ideally this would be something you can pop into within the IDE. I think something like this would accomplish that: /** @see {SafeAwait['terminal-control-flow']} */. Short of that, I think the URL should be stable and go directly to the explanation, which both rule out the Github wiki. E.g. shorten.er/await-safety#terminal-control-flow

For the "IIUC" suppressions, I think you did so I pushed package level disabling (agoric-cli, solo).

I also noticed the change to allValues was just one of the few instances so I pushed a change to consolidate them.

@@ -25,8 +25,10 @@ const { keys, values, fromEntries } = Object;
* @returns {Promise<Record<string, V>>}
* @template V
*/
const allValues = async obj =>
fromEntries(zip(keys(obj), await Promise.all(values(obj))));
const allValues = async obj => {
Copy link
Member

@turadg turadg Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have several of these. I pushed 667ad5c to DRY them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

Copy link
Member

@turadg turadg Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like your force pushes since then have erased 667ad5c

I've pushed it again. @erights mind pulling f902c5d your local branch?

@@ -1,3 +1,8 @@
// IIUC the purpose of the agoric-cli package, we do not need to worry
// about await safety for any of the code in this package.
// TODO someone who understands this package better should verify this.
Copy link
Member

@turadg turadg Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed. let's disable for the whole package. 397c702

@@ -1,3 +1,8 @@
// IIUC the purpose of the solo package, we do not need to worry
Copy link
Member

@turadg turadg Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled for package. 0163fee

.eslintrc.cjs Outdated
// are code-reviewed.
// The rule is “no nested awaits” but the architectural goal is “no
// possibility of ‘awaits sometimes but not always’”. That is our
// architectural rule for vats which require distributed determinism. If
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is distributed determinism relevant? I thought the issue was correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing this part I forgot to mention. I was trying to distinguish why CLI and solo didn't need this linking.

My understanding is it's only relevant when there things can happen in different turns, ergo blockchain context.

What's a better description of when the lint rule needs to be enforced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the purposes of the cli and solo packages? What are the worst case consequences of an interleaving bug in the code of these packages?

Depending on the answers, we may want to leave these warnings on for these packages and carry through the triage for the problematic awaits is these packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different turns happen any time code doesn't complete sequentially; every await or promise.then(...) or I/O callback introduces another turn. No blockchain context needed.

My understanding is that the rule is relevant any time we want correct code; that is: to avoid a pattern that is known to cause bugs. I could be wrong. @erights ? @dtribble ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the purposes of the cli and solo packages? What are the worst case consequences of an interleaving bug in the code of these packages?

CLI is ergonomics for an end-user (or dev) to craft and send commands. Solo is is the solo Vat runner, which tmk is for testing. I can't think of any dire consequences to bugs in either.

Different turns happen any time code doesn't complete sequentially

Where can I read more about this? I've never encountered the term before Agoric and can't find it in web search.

If this PR needs to land for PS0, @erights feel free to remove any of my changes. I can follow up with them later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though they don't seem to use the term "turn", the concept is well known in the JS world, yes? A "turn" is just one turn through the event loop.

For example, https://developer.mozilla.org/en-US/docs/Web/JavaScript/EventLoop:

At some point during the event loop, the runtime starts handling the messages on the queue, starting with the oldest one. To do so, the message is removed from the queue and its corresponding function is called with the message as an input parameter. As always, calling a function creates a new stack frame for that function's use.

The processing of functions continues until the stack is once again empty. Then, the event loop will process the next message in the queue (if there is one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. I do agree that the whole "Robust Composition" thesis is worth reading in due course. Though it's over 200 pages, it's not dense material. Much of it is stuff you almost certainly already know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solo is is the solo Vat runner, which tmk is for testing.

Solo powers ag-solo; it's it's intended, in due course, for dapp back-ends that may hold significant assets. @dtribble recently gave an example of a gateway to an online game world, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like we should just drop my two eslint/comment changing commits. I figure that's easiest to do by interactive rebase and force push. To avoid clobbering any concurrent work I'll refrain unless you request it @erights

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it when I get back to working on this PR. Thanks.

Comment on lines +220 to +225
// In that case, the function exits would executing and further
// code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t parse this sentence. Perhaps s/would/without/, s/any/any/?

Comment on lines +229 to +234
// that initial await may therefore be reached synchronously or
// asynchronously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression awaited at the top of the cthuloop, messagesFromXsnap.next(undefined), just manipulates an entangled pair of async promise queues, putting undefined on one and getting the next promise from the other. I’m not sure whether this affects your analysis.

//
// Some code does execute after this one, synchronously or asynchronously.
// That would normally be unsafe, but all that code is clearly not stateful.
// Its correctness is independent on timing.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Its correctness is independent on timing.
// Its correctness is independent of timing.

Comment on lines 377 to 379
atomicWriteInRoot(baseName, async tmpGzPath =>
filter(tmpSnapPath, createGzip(), tmpGzPath, { flush: true }),
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6240

// loop body safer. A `for await of` will correctly process a synchronous
// iterator, but it will unconditionally separate the iterations with
// with turn boundaries.
for await (const step of steps) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6242

@erights erights force-pushed the markm-triage-await-safety branch from a1a5790 to 08b83f6 Compare October 1, 2022 03:42
@erights erights force-pushed the markm-triage-await-safety branch 2 times, most recently from 3dd42e2 to 43c98b8 Compare October 9, 2022 04:27
@erights erights force-pushed the markm-triage-await-safety branch from 43c98b8 to 2434353 Compare October 11, 2022 04:16
@erights erights force-pushed the markm-triage-await-safety branch 2 times, most recently from 95f67b4 to cd75501 Compare October 20, 2022 06:31
@erights
Copy link
Member Author

erights commented Oct 20, 2022

Reviewers, this PR is in good shape. The embargo has been lifted. AFAIK, nothing significant changed since the reviews above. But, this is the PR that I'm most nervous about needing to keep up to date to follow changes on master. I'd much rather merge it before there are substantial changes on master to reduce the coordination work. Thanks!

Copy link
Member

@dckc dckc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the whole thing... when I saw separate issues filed for the various parts, I assumed this PR would get split up similarly. But I gather you're interested to merge this soonish.

Please let's do refer to the full explanations by reference rather than having them inline (e.g. vat-warehouse.js)

And we should have a test for each fix. Perhaps that's not cost-effective, but let's make that decision explicitly; no fair skipping the "testing considerations" part of the PR template :)

I'm not sure which reviewer is in a position to approve this whole thing. I am not. Perhaps a PR per package? hm... a dozen PRs is a lot of tedium. Maybe separate commits? Also tedious, and doesn't clarify who approved what very well.

@turadg
Copy link
Member

turadg commented Oct 21, 2022

Please let's do refer to the full explanations by reference rather than having them inline (e.g. vat-warehouse.js)

Start of that https://gist.github.com/turadg/e7d7645789218173a578508ae7640365 (now at endojs/Jessie#93)

assumed this PR would get split up similarly

For splitting up:
@warner #6526

  • swingset
  • cosmic-swingset

@turadg #6528

  • agoric-cli
  • inter-protocol
  • internal
  • web-components

@michaelfig #6529

  • cache
  • deploy-script-support
  • solo

@mhofman #6530

  • telemetry

@dckc #6531

  • vats
  • xsnap

@Chris-Hibbert #6533

  • zoe

@erights erights force-pushed the markm-triage-await-safety branch from 5e32b38 to e342a67 Compare November 19, 2022 05:19
@erights
Copy link
Member Author

erights commented Dec 7, 2022

Closing in favor of the unbundling at #6219 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment