Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add consensus-independent vat snapshot/transcript archiving configuration #10055

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Sep 10, 2024

Fixes #10036

Description

Adds consensus-independent vat-{snapshot,transcript}-archive-dir cosmos-sdk swingset configuration for propagation in AG_COSMOS_INIT. When set, gzip-compressed vat snapshot/transcript files are written to the respective directory (the former when created, the latter when finalized by rollover).

It also includes miscellaneous documentation and code improvements, most significantly a new helper for replacing const cleanups = []; return aggregateTryFinally(async () => { … cleanups.push(…); … }, async () => { await PromiseAllOrErrors(cleanups.reverse().map(fn => Promise.resolve().then(() => fn()))); }); with return withDeferredCleanup(async addCleanup => { … addCleanup(…); … }); (i.e., folding in the cleanup registration and ordering).

Best reviewed commit-by-commit.

Security Considerations

This provides a new interface by which the Swingset kernel can write to its host filesystem, albeit indirectly via callback. But as such, it is a potential vector for filling up a disk or possibly overwriting files, although subject to an operator's own bad configuration and/or elevated permissions.

Scaling Considerations

When one or both new settings are active and the disk is full or slow, performance may suffer. I considered performing these writes in the background, but decided that silent failures or slow writes would be worse than the new awaits (and when active, the archive dir will likely be on the same filesystem as the database anyway).

Documentation Considerations

I think I've covered what I need to, although there might be some other Markdown files worth updating.

Testing Considerations

New features are covered by unit tests.

Upgrade Considerations

This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of process, could we split and merge the first part of this PR that addresses #10054? It seems like it can stand on its own and is sufficiently different than the archive change. I have reviewed that set of commit and would approve such a PR.

@@ -257,18 +257,27 @@ export function makeTranscriptStore(
WHERE vatID = ? AND position >= ? AND position < ?
`);

function doSpanRollover(vatID, isNewIncarnation) {
function closeSpan(vatID, bounds) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment that closeSpan does not note any changes to the .current export and that the caller is in charge of that. An alternative would be to always note a deletion of that export when closing, but that's unnecessary churn for the common case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

cloudflare-workers-and-pages bot commented Sep 10, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: bf76e00
Status: ✅  Deploy successful!
Preview URL: https://1af799c3.agoric-sdk.pages.dev
Branch Preview URL: https://gibson-10036-archive-transcr.agoric-sdk.pages.dev

View logs

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, but please consider some of the suggested refactorings.

* @template T
* @param {() => Promise<T>} trier
* @param {(error?: unknown) => Promise<unknown>} finalizer
* @returns {ReturnType<trier>}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ReturnType here instead of Promise<T> ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason, just a way to emphasize that the happy-path result is just decoration of the function.

*
* @template T
* @param {(
* addCleanup: (fn: (err?: unknown) => Promise<void>) => void,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think addCleanup should be allowed to accept a function that returns something else than a promise, but the problem is that Promise<void> | void has undesirable behavior. Maybe Promise<any> | any, just for documentation purposes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be easier to just wrap with async () => { … }, but we can cross that bridge if/when we get there. This is an internal async helper, and there's no current need for supporting non-async callbacks.

packages/swing-store/test/deletion.test.js Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the asyncification in a standalone refactor commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually started that way, but it turned out that only two of the changes were non-fundamental.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "non-fundamental". I consider making a function and all its callers transitively async to be a pure refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you want to see a commit which is just marking functions as async and inserting await at their call sites, with no point-in-time motivation for such a refactor until the following commit adds internal awaits, as opposed to a42c45f including both? I don't think I would find that useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my soft preference, yes. It's similar to moving some code around in a first commit in preparation for a subsequent commit which has a more minimal behavioral change.

Comment on lines -3 to +11
import { resolve as pathResolve } from 'path';
import path from 'node:path';
import v8 from 'node:v8';
import process from 'node:process';
import fs from 'node:fs';
import fsPromises from 'node:fs/promises';
import { performance } from 'perf_hooks';
import { resolve as importMetaResolve } from 'import-meta-resolve';
import tmpfs from 'tmp';
import { performance } from 'node:perf_hooks';
import { fork } from 'node:child_process';
import { resolve as importMetaResolve } from 'import-meta-resolve';
import tmp from 'tmp';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the rename of path and tmp in a standalone refactor commit.

const store = initSwingStore(dbDir, {
exportCallback,
keepTranscripts,
archiveTranscript,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the swingStore does extra / different work when archiveTranscript is present, should this be a variation of the test?

Same for archiveSnapshot, especially if we end up teeing the compression stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work is purely additive, so I figured it was fine to always check in these tests rather than multiply its configuration to the cross product of keepTranscripts={true,false} and archiveTranscript={undefined,fn}—plus there's lots of testing that covers initSwingStore with archiveTranscript=undefined, including in this file. Do you feel strongly about making this a test variation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strongly no, but since we already have a macro, I thought it could be cheap enough. Not all permutations need to be tested (e.g. transcript options are fully independent of the snapshot option)

name: tmpName,
fd,
removeCallback,
} = tmp.fileSync({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the sync version if we're in an async function ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason; this step doesn't need to be async and honestly wouldn't benefit from it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering because I believe it means the removeCallback is sync, which may be unexpected (according to types) for the cleanup. I do remember that tmp had some surprising changes in behavior between sync and async, but I'm not sure what those where now.

packages/swing-store/src/archiver.js Show resolved Hide resolved
packages/swing-store/src/archiver.js Show resolved Hide resolved
packages/swing-store/src/snapStore.js Outdated Show resolved Hide resolved
@gibson042
Copy link
Member Author

As a matter of process, could we split and merge the first part of this PR that addresses #10054? It seems like it can stand on its own and is sufficiently different than the archive change. I have reviewed that set of commit and would approve such a PR.

#10060

@gibson042 gibson042 force-pushed the gibson-10036-archive-transcripts branch from c5f7c6c to 4dcdb4f Compare September 11, 2024 05:13
Comment on lines 291 to 296
deferred.push(kernelKeeper.removeVatFromSwingStoreExports(vatID));
for (const kpid of deadPromises) {
resolveToError(kpid, makeError('vat terminated'), vatID);
}
}
await Promise.all(deferred);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to preserve the log ordering in terminate.test.js. Alternatively (or additionally), we could update bootstrap-terminate-with-presence.js to tolerate ordering variation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about why the simple await was not working.

Copy link
Member Author

@gibson042 gibson042 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring it out was an interesting dive... vat-vat-admin.js terminateWithFailure ends with a synchronous call to D(vatAdminDev).terminateWithFailure, which through further synchronous calls in device-vat-admin.jskernel/deviceManager.jskernel.jskernelSyscall.jsvat-admin-hooks.js
invokes void terminateVat(vatID, true, marshalledReason), which means that only the synchronous prelude of terminateVat has an effect before the result promise of E(vatAdminNode).terminateWithFailure(reason) settles and enqueues notifies.

Before this PR, kernelKeeper.removeVatFromSwingStoreExports was synchronous and therefore could not impede the following step's rejection of promises yet to be decided by the terminating vat (which therefore preceded fulfillment of the terminateWithFailure result promise).

But now that this PR is making removeVatFromSwingStoreExports async, a naïve await thereof would end the synchronous prelude before dead promise rejection, which is what caused the assertion in terminate.test.js to fail. Deferring the await restores the expected ordering, although we may want to move it even further down to avoid bumping following logic out of the synchronous prelude, e.g.:

    if (critical) {
      // The following error construction is a bit awkward, but (1) it saves us
      // from doing unmarshaling while in the kernel, while (2) it protects
      // against the info not actually encoding an error, but (3) it still
      // provides some diagnostic information back to the host, and (4) this
      // should happen rarely enough that if you have to do a little bit of
      // extra interpretive work on the receiving end to diagnose the problem,
      // it's going to be a small cost compared to the trouble you're probably
      // already in anyway if this happens.
      panic(`critical vat ${vatID} failed`, Error(info.body));
    } else if (vatAdminRootKref) {
      // static vat termination can happen before vat admin vat exists
      notifyTermination(
        vatID,
        vatAdminRootKref,
        shouldReject,
        info,
        queueToKref,
      );
    } else {
      console.log(
        `warning: vat ${vatID} terminated without a vatAdmin to report to`,
      );
    }

    // worker, if present, needs to be stopped (note that this only applies to ephemeral vats)
    deferred.push(() => vatWarehouse.stopWorker(vatID));

    await Promise.all(deferred);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually would want to go a step further and remove the async nature of this function, but still make it return a promise. However that would release zalgo somewhat, so we should not do that in a hurry now.

Please go ahead with the proposed change above, but use PromiseAllOrErrors to avoid swallowing multiple errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10073 to track addressing this footgun

mergify bot added a commit that referenced this pull request Sep 11, 2024
…n rollover (#10060)

Fixes #10054

## Description
The first commits from #10055, [by request](#10055 (review)), culminating in a fix of #10054 by introducing a `closeSpan` helper to support both span rollover and `stopUsingTranscript`.

### Security Considerations
None known.

### Scaling Considerations
If anything, a negligible reduction in transcriptStore size.

### Documentation Considerations
None known.

### Testing Considerations
Covered by unit tests.

### Upgrade Considerations
This is kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade).
@gibson042 gibson042 force-pushed the gibson-10036-archive-transcripts branch from 4dcdb4f to e322e2b Compare September 11, 2024 05:57
@gibson042 gibson042 force-pushed the gibson-10036-archive-transcripts branch from 4ff2fd7 to bf76e00 Compare September 11, 2024 18:25
@gibson042 gibson042 added the automerge:rebase Automatically rebase updates, then merge label Sep 11, 2024
@mergify mergify bot merged commit 07f6885 into master Sep 11, 2024
90 checks passed
@mergify mergify bot deleted the gibson-10036-archive-transcripts branch September 11, 2024 19:01
gibson042 added a commit that referenced this pull request Sep 11, 2024
mergify bot added a commit that referenced this pull request Sep 11, 2024
Ref #10055

## Description
The internal `makeAggregateError` introduced by #10055 properly logged fake-AggregateError errors, but failed to expose them on the error object. But this is [already supported in endo](https://github.com/endojs/endo/blob/4406f5dde521d539a5effeb8ab68c1316e45261d/packages/ses/src/error/assert.js#L321), so we should use it.

### Security Considerations
n/a

### Scaling Considerations
n/a

### Documentation Considerations
n/a

### Testing Considerations
n/a

### Upgrade Considerations
n/a
@@ -289,13 +392,13 @@ export function makeTranscriptStore(
}
}

function doSpanRollover(vatID, isNewIncarnation) {
async function doSpanRollover(vatID, isNewIncarnation) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this makes me sad. Swingstore wants to be synchronous for all the "normal" day-to-day operations. I'm ok with async for commit and exports and things done by the host, but I'd really prefer that everything the kernel touches is synchronous.

I see that it happened to fit into the existing callers (they're already awaiting a snapshot write, or a vat eviction), so the contagion didn't have a big impact, but in general, any async operational API call raises concerns about what state the swingstore is in while that Promise is pending.

In this case, there's a period where the old span is closed, but the new span has not yet been created (we're sitting in the await closeSpan(), which is sitting in its await archiveTranscript). If someone calls into swingstore while that Promise is pending, we'll have no isCurrent spans, and doing something like addItem will throw (at best) or be very confused (at worst). With a synchronous interface, this invariant-violating state is not exposed.

As you know, I'm not a fan of async file IO, I'm not convinced that there's enough perf improvement (if any) to justify the complexity. But I do see that you didn't have a lot of choice.. the gzip streams used by makeArchiveTranscript would be a PITA to use synchronously (but that just makes me sad about the gzip APIs we have available).

So, ok, but let's make it a goal to keep the API synchronous, and only allow exceptions like this when really necessary.

(sorry I missed this when the review was pending, I was OOO).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries; after the fact reviews are still much appreciated! Should we open a new ticket about refactoring to isolate asynchronicity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

swingstore mode to move/archive historical transcript spans to separate files, outside the DB
3 participants