Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define worker-v1 package #6596

Open
warner opened this issue Nov 22, 2022 · 19 comments
Open

define worker-v1 package #6596

warner opened this issue Nov 22, 2022 · 19 comments
Assignees
Labels
enhancement New feature or request liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet vaults_triage DO NOT USE xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Nov 22, 2022

What is the Problem Being Solved?

To give us a coherent solution for #6361 , I think we need to define a new package that includes all the behavior-influencing parts of our XS-based workers. This package will have -v1 in the name, not the version, and in fact we will only ever publish a single version of this package: all subsequent changes will go into a new package, with -v2 in the name, which will also only get one release, etc.

Then we change SwingSet to depend upon this -v1 worker package, and add a note to the kernelDB per-vat metadata to record the fact that the incarnation is using v1, rather than some earlier or later version. Any version of SwingSet can host this worker and its history (especially its XS heap snapshots) without fear of behavior changes.

The nominal name of the new package is @agoric/worker-v1, but we might want something more swingset-specific.

Description of the Design

@agoric/worker-v1 will need to contain or import the contents of:

  • packages/SwingSet/src/supervisors/subprocess-xsnap/
  • packages/SwingSet/src/liveslots (also used by other workers, and a lot of kernel unit tests)
  • packages/xsnap
  • a few shared files from packages/SwingSet/src/lib/

The kernel-facing API of this package should let it launch a worker and get back control facets to send deliveries down and process syscalls that come back.

Security Considerations

Test Plan

@warner warner added enhancement New feature or request SwingSet package: SwingSet xsnap the XS execution tool labels Nov 22, 2022
@warner warner self-assigned this Nov 22, 2022
@warner
Copy link
Member Author

warner commented Dec 23, 2022

Perhaps we should make the bundle files for this package during the build process (and publish them to NPM), so their contents are well-specified. Every yarn build is expected to create exactly the same contents. Endo's bundle-source includes version numbers in the generated bundle (as part of the component map), so the following pieces must be stable:

  • every line of code and every version string in the dependency graph that gets copied into the bundle
  • the behavior of everything else that might influence the creation of the bundle (bundle-source behavior)

Having a yarn.lock should enforce that. There will be a very limited set of changes that could possibly be made (between the original worker-v1@1 and a subsequent worker-v1@2 release) that can still maintain this property, and none of them can affect the behavior of the generated bundle. E.g. we might find a new Endo version of bundle-source that bundles twice as fast as before: we can use that, as long as the bundles it creates are identical to the version used by @1. But a new bundle-source that creates bundles half the size as @1 could not be used.

@kriskowal
Copy link
Member

We cannot rely on yarn.lock to pin exact versions of the transitive dependencies of an *-v* package in a workspace where more than one version shares a yarn.lock, since yarn is free to solve the dependency graph in a way that satisfies the most recent version. We either need these versioned worker packages to contain a checked-in bundle and exact xsnap binary, or we need each of these versions to have an independent yarn.lock. We should jam on this.

@kriskowal
Copy link
Member

(That is, if yarn.lock is free to change after worker-v1 gets captured, worker-v1 behavior will change.)

@warner warner added the liveslots requires vat-upgrade to deploy changes label Jan 24, 2023
warner added a commit that referenced this issue Jan 27, 2023
This creates a new `@agoric/swingset-liveslots` package, which owns
the old `packages/SwingSet/src/liveslots/` code. Several tests which
specifically exercised liveslots (but not the rest of the kernel) were
moved over, along with some helper files. The "fake virtual stuff"
tools in SwingSet/tools/ were moved over too.

No file contents were changed, so the new files won't work yet (and
this breaks SwingSet itself), until we modify them in the next
commit. This "move, then modify" approach should help with git
rebasing and history management in the future, as well as code review.

New files were created for the metadata: package.json,
jsconfig.json. The agoric-cli `sdk-package-names.js` file was
regenerated to include the new package.

refs #6596
warner added a commit that referenced this issue Jan 27, 2023
This creates a new `@agoric/swingset-liveslots` package, which owns
the old `packages/SwingSet/src/liveslots/` code. Several tests which
specifically exercised liveslots (but not the rest of the kernel) were
moved over, along with some helper files. The "fake virtual stuff"
tools in SwingSet/tools/ were moved over too.

No file contents were changed, so the new files won't work yet (and
this breaks SwingSet itself), until we modify them in the next
commit. This "move, then modify" approach should help with git
rebasing and history management in the future, as well as code review.

New files were created for the metadata: package.json,
jsconfig.json. The agoric-cli `sdk-package-names.js` file was
regenerated to include the new package.

refs #6596
warner added a commit that referenced this issue Jan 27, 2023
This creates a new `@agoric/swingset-liveslots` package, which owns
the old `packages/SwingSet/src/liveslots/` code. Several tests which
specifically exercised liveslots (but not the rest of the kernel) were
moved over, along with some helper files. The "fake virtual stuff"
tools in SwingSet/tools/ were moved over too.

No file contents were changed, so the new files won't work yet (and
this breaks SwingSet itself), until we modify them in the next
commit. This "move, then modify" approach should help with git
rebasing and history management in the future, as well as code review.

New files were created for the metadata: package.json,
jsconfig.json. The agoric-cli `sdk-package-names.js` file was
regenerated to include the new package.

refs #6596
warner added a commit that referenced this issue Jan 28, 2023
This creates a new `@agoric/swingset-liveslots` package, which owns
the old `packages/SwingSet/src/liveslots/` code. Several tests which
specifically exercised liveslots (but not the rest of the kernel) were
moved over, along with some helper files. The "fake virtual stuff"
tools in SwingSet/tools/ were moved over too.

No file contents were changed, so the new files won't work yet (and
this breaks SwingSet itself), until we modify them in the next
commit. This "move, then modify" approach should help with git
rebasing and history management in the future, as well as code review.

New files were created for the metadata: package.json,
jsconfig.json. The agoric-cli `sdk-package-names.js` file was
regenerated to include the new package.

refs #6596
warner added a commit that referenced this issue Jan 28, 2023
This creates a new `@agoric/swingset-liveslots` package, which owns
the old `packages/SwingSet/src/liveslots/` code. Several tests which
specifically exercised liveslots (but not the rest of the kernel) were
moved over, along with some helper files. The "fake virtual stuff"
tools in SwingSet/tools/ were moved over too.

No file contents were changed, so the new files won't work yet (and
this breaks SwingSet itself), until we modify them in the next
commit. This "move, then modify" approach should help with git
rebasing and history management in the future, as well as code review.

New files were created for the metadata: package.json,
jsconfig.json. The agoric-cli `sdk-package-names.js` file was
regenerated to include the new package.

refs #6596
@warner
Copy link
Member Author

warner commented Jan 30, 2023

(hm, ok that may or may not torpedo the following plan)

I've spent more time thinking about this. The constraints are:

  • the swingset-kernel package (currently in agoric-sdk:packages/SwingSet, with NPM name @agoric/swingset-vat) must be able to simultaneously import multiple versions of an "xs-worker" package, each of which may have different subsets of XS/xsnap.c/lockdown/supervisor/liveslots behaviors
    • e.g. one package with NPM name @agoric/swingset-worker-xs-v1, and a second package with NPM name @agoric/swingset-worker-xs-v2
  • later versions of the kernel may choose to change their swingset-worker-xs-v1 dependency from version 1.0.0 to 1.1.0 to 1.2.0, for various reasons, with full knowledge of the compatibility(vs bug-fixing) consequences thereof
  • a given version of swingset-worker-xs-v1 must consistently use the exact same XS library, xsnap.c application, lockdown bundle, and supervisor (+liveslots) bundle
  • two related versions of swingset-worker-xs-v1 should be able to change one of these things without changing the other
    • e.g. swingset-worker-xs-v1 @ 1.2.0 might use the same bundles as @1.1.0, but use an XS library with some bugfixes or performance improvements
    • we might encode the compatibility consequences into the 1.2.0 string, like maybe 1.1.0 -> 1.1.1 might change performance but not metering/syscall/GC behavior, whereas 1.1.0 -> 1.2.0 might allow metering/syscall/GC behavior but retains heap-snapshot compatibility. maybe.

We have two bundles used to start an xsnap worker. The first is named "the lockdown bundle", and currently the entry point is SwingSet/src/supervisors/subprocess-xsnap/lockdown-subprocess-xsnap.js , which is a one-line library that imports @agoric/xsnap/lib/ses-boot.js, which is a 5-line library that does:

import { setObjectInspector } from './console-shim.js';
import '@endo/init';
import confinedObjectInspect from './confined-object-inspect.js';
setObjectInspector(confinedObjectInspect);
harden(console);

The second is named "the supervisor bundle", and the entry point is supervisor-subprocess-xsnap.js, which imports liveslots, creates an instance of liveslots, and prepares the worker to receive netstring commands.

Our bundles include the package name, package version, and relative pathnames of every file in the bundle, including the entry point. To generate identical bundles, it is not sufficient for the entrypoint file to have the same contents: it must also live in the same package, of the same package version, and at the same relative path.

That means swingset-worker-xs-v1 cannot hold the entrypoint file, since we want two separate versions of this package to be capable of using the exact same bundles (controlled differences). As @kriskowal pointed out, this really means the bundles should be generated as part of a build process, and published to NPM (with of course some simple process to regenerate them locally to compare against the published versions).

The Plan

So that makes me think the following plan could work:

worker-v1 #6596 - Frame 1

The light blue boxes represent packages, at various versions. The blue connecting lines are cross-package dependencies. The white boxes are JS files or JS-level exports of the package, and the black connecting lines are import statements.

The @agoric/swingset-kernel package (originally swingset-vat) is the anchor. The first version we publish only supports "worker-v1", but a later version could support both -v1 and -v2.

The "worker-v1" support is provided by a package named @agoric/swingset-worker-xs-v1. This depends upon three other packages.

The first is xsnap. This lives in a package named @agoric/xsnap at some specific version, which provides both the xsnap binary and the JS libraries which talk to it.

  • This package uses git submodule to reference two other repositories: one for the XS library (currently our fork in agoric-labs/moddable), and a second for the xsnap.c code (also a fork, in agoric-labs/xsnap-pub).
  • When built, this package generates an xsnap-worker executable, at a well-known location within the package directory
  • The JS code in this library knows how to spawn a child process with this executable, and how to exchange messages over the netstring pipe to control it

The second is a package named @agoric/xsnap-lockdown. This provides the lockdown bundle. In particular, it contains ses-boot.js, which imports Endo, and configures the console (to accomodate the XS print function that provides the sole access to stdout). The lockdown package depends upon some particular version of Endo, by virtue of a package.json dependency on the @endo/init package.

The lockdown package exports two API functions. The first is getBundleURL(), which returns a URL that points to a pre-generated bundle file. (maybe we make it async for future flexibility). The second is a getVersions() that returns an object with at least { version, bundleSHA } strings. The bundle would be generated by calling bundleSource() on an entry-point file that lives inside the lockdown package, at npm build time, so the bundle would be included in the tarball that gets uploaded to NPM. At the same npm build time, we hash the bundle and record it in a generated file that is consulted by getVersions(). Clients should, at the very least, record the expected bundleSHA in their own source code, and compare it against the retrieved value, at each application startup. More cautious clients should also independently hash the file at getBundleURL() and assert that it matches their expectation. These are to prevent compatibility/determinism surprises in case somethng in the lockdown package (or further downstream) gets updated somehow, like if yarn computes a different dependency solution. When a client decides to switch to a new version of the lockdown package, they should update their expectations accordingly.

Alternatively, we might choose to have getBundleURL() perform the bundling process in that moment, writing the contents to some cache directory. However the hash should remain the same. Using this approach would make it more obvious that the @agoric/xsnap-lockdown version also depends upon the @endo/bundle-source version used to make the bundle, however it is common that a newer bundle-source could still produce the same bundles, whereas it is exceedingly unlikely that bundling a newer @endo/init package could do the same.

The third package is the "supervisor package", which will live in @agoric/swingset-xs-supervisor at some particular version. This uses the same build/bundling process as the lockdown bundle, but its entrypoint is a supervisor.js, which has considerably more code, and is quite swingset-specific. This supervisor will import code from the @agoric/swingset-liveslots package (#6863).

The @agoric/swingset-worker-xs-v1 package exports a makeStartXSnap() function. This creates a new xsnap worker and prepares it for vat deliveries. If the options bundle provides a heap snapshot to start from, the new worker is loaded from that snapshot. If not, the new worker is empty, and makeStartXSnap() instructs it to evaluate the lockdown and supervisor bundles.

XS Compatibility Levels

When considering moving from one version of XS to another, we must think about how similar the new version behaves, compared to state or actions created under the previous version. There are two things for the new behavior to be compatible with:

  • "new delivery consistency": replicas in a consensus machine (i.e. validators in a blockchain) must each execute the transactions in new blocks in the same way: validator A and validator B receive the same block, with the same transactions, and they must produce the same output state
  • "replay consistency": when an individual machine is restarted, or one of its vat workers is brought back online for whatever reason, and we replay the vat's recent transcript to bring the worker back up to speed, the behavior during the replay must match the original behavior as recorded in the transcript

A failure of new-delivery consistency would cause a validator to fall out of consensus. A failure of replay consistency might be caught by the individual validator as it noticed a discrepancy between the transcript-recorded syscalls (or delivery results) and the replay.

Now we can classify potential changes in XS behavior into 4-ish categories:

  • "level A": XS behaves the same way: syscalls, GC timing, metering, allocation are all the same, for any conceivable input
    • this doesn't leave a lot of room for XS to change, however pure performance improvements would qualify
    • it could also include fixes for potential crashes that have not yet been experienced on-chain
      • i.e. any XS crash would kill the xsnap worker, which would panic the kernel, which means the resulting state doesn't get recorded, so a new version is allowed to diverge by surviving the same operation
      • we could call this "anthropic determinism": if anybody is around to observe the behavior, it must be deterministic, but if all the observers died, then divergence can be tolerated
    • it cannot contain fixes for crashes that would not have panicked the old kernel, because we can't guarantee that such a crash didn't happen in some deployment
    • a validator should be able to restart from scratch, or from any point in recorded history, with any A-compatible worker version, without experiencing divergent behavior
      • it could even downgrade to an older A-compatible version
  • "level B": XS behaves the same way for all activity seen on the chain so far
    • this is dependent upon a particular deployment, and what sorts of messges it has observed
      • implies a conformance test which replays the chain corpus and compares the outputs of the new worker against the transcripts/slogfiles
      • the conformance test might look at all historical activity: start from the beginning of the chain
        • foregoing this would limit forms of upgrade/downgrade/restart-from-scratch options
          • this is the problem we had with the pismoB release, as it was not suitable to start from scratch (agoric-upgrade-8)
      • more likely, the conformance test might only look at deliveries since the most recent heap snapshot
    • this level allows slightly more significant changes to behavior
    • we could only safely deploy this:
      • if the conformance test passes
      • and if we're lucky enough to not have divergence during the gap from the end of the tested corpus to the point of upgrade
      • and we deploy it to all validators at the same time: no per-validator upgrade/downgrade choices
    • vats would experience their transcripts (recorded with the old version) get replayed with the new version
    • new deliveries would all use the new version
  • "level C": the new XS can read the old snapshot, but its new syscall/GC/allocation behavior may be different, even in historical deliveries
    • "C-gc": syscall behavior is the same, but GC/allocation may differ
      • if we can really get liveslots to be insensitive to organic GC, and disable metering around anything sensitive to GC
      • then we might be able to use this somehow
    • "C-syscall": even syscall behavior diverges
    • it would only be safe to switch to a C-compatible worker version just after a heap snapshot point, when there are no transcripts to be replayed, so we don't have to worry about replay-consistency
      • e.g. we might keep using the old version until the next BOYD+heap-snapshot point, then switch over
      • both versions must be available to the kernel until that point, however we don't need a trauma-inducing baggage-style upgrade
      • we might accelerate the BOYD/heap-snapshot point if we want to get a new XS deployed soon
  • "level D": the new XS can't even read the old snapshots
    • switching to a D-compatible worker could only be done with a baggage-style vat upgrade

Changing XS

The compatibility level of XS determines how we might deploy it to the chain.

A-level compatiblity

Since A-level changes to XS do not cause observable behavior differences, we should be able to deploy them at any time. We would publish a new version of @agoric/xsnap with a micro version number bump (e.g. 1.0.0 to 1.0.1), which points to a different git-submodule commit hash of agoric-labs/moddable (for XS) or agoric-labs/xsnap-pub( for xsnap itself). Then we'd publish an @agoric/swingset-worker-xs-v1 package with a micro version bump, and we'd change @agoric/swingset-kernel to reference the new worker package. @agoric/swingset-worker-xs-v1 would continue to reference the same lockdown and supervisor packages as before.

Validators could switch to the new kernel at their leisure, without requiring a coordinated chain upgrade event.

B-level compatibility

If the XS changes are classified as "B-level", then the upgrade process requires conformance testing (and a little bit of luck, given that it might take a week from the time testing must stop to the time a chain-upgrade vote can complete). We'd produce the same packages as above, but we'd bump the minor version (e.g. 1.0.0 to 1.1.0), to indicate the lower level of compatibility. All validators would need to move to the new @agoric/swingset-kernel version at the same time.

(note: maybe the minor version isn't the right component to change; we might need four levels of version number to capture all of this).

C-level compatibiltiy

For C-level changes, the kernel will need to run both old and new versions for some period of time (since not all vats will reach their BOYD+heap-snapshot point at the same moment, and that moment certainly doesn't happen at the moment of chain upgrade). Therefore we must allocate a new @agoric/swingset-worker-xs-vNNN number (e.g. adding a package named @agoric/swingset-worker-xs-v2). This new package would depend upon a new xsnap version that can have a major(??) version bump

The new kernel version would depend upon both -v1 and -v2. The kernel would know (somehow? maybe the getVersions() return values??) that this upgrade must be perform at a heap-snapshot boundary, and would schedule a change from -v1 to -v2 at that point.

D-level compatibility

For D-level changes, we definitely need separate -v1 and -v2 packages. The kernel would wait until the next baggage-style vat-upgrade to swap in the new version. Only a change at this level is allowed to change the lockdown (SES/Endo) or liveslots bundles.

It seems likely that the kernel might not automatically upgrade vats across this boundary. Instead, the kernel upgrade changes the default "what worker should new vats use?" value, which can also be the value used when upgrading an existing vat. Then, userspace (e.g. contract governance committee action) causes a vat upgrade for each vat, and the bundles change during that upgrade.

Changing Endo/SES, or Liveslots

We create a new version of @agoric/swingset-lvieslots or Endo, then we make a new version of @agoric/swingset-xs-supervisor or @agoric/xsnap-lockdown. Then we make a new package like @agoric/swingset-worker-xs-v2, just like in the "D-level" change above. The kernel depends upon both -v1 and -v2, and a userspace-triggered upgrade takes advantage of the new versions.

@warner
Copy link
Member Author

warner commented Jan 30, 2023

So here's a plan of attack:

  • delete the unused workers (subprocess-node, and the thread-based nodeWorker), since they're incomplete, under-tested, and less code means less migration work
  • land move liveslots out to a separate package #6863 to move liveslots to its own package
  • move makeStartXSnap from SwingSet/src/controller/controller.js to SwingSet/src/make-xs-worker/
  • move supervisors/subprocess-xsnap there too
  • make a copy of supervisor-helper.js in there too
    • now src/make-xs-worker/ is basically everything needed for our new package
  • move src/make-xs-worker/ into a new packages/swingset-xs-worker-v1/
    • it will depend upon swingset-liveslots and xsnap
  • create packages/swingset-xs-lockdown, move lockdown-subprocess-xsnap.js to it
    • give it a build script thta makes a bundle file
    • export getBundleURL() and getVersions()
    • change swingset-xs-worker-v1 to read the bundle during worker init, and compare hashes
  • create packages/swingset-xs-supervisor, move supervisor-subprocess-xsnap.js and supervisor-helper.js to it
    • same build script, bundle/versions/hash API
  • move all the swingset packages (except maybe the kernel) out to a new repository, with a different versioning policy

@mhofman
Copy link
Member

mhofman commented Jan 30, 2023

  • make a copy of supervisor-helper.js in there too

It would be great if during this process every file ended up as a git move so we can keep history somewhat traceable.

@warner
Copy link
Member Author

warner commented Jan 30, 2023

  • make a copy of supervisor-helper.js in there too

It would be great if during this process every file ended up as a git move so we can keep history somewhat traceable.

Agreed, although I don't know how to make copies trigger git's heuristic. OTOH, if you're following it with git log --follow, you'll probably see it "created from nothing" at the commit that actually copied it, and I'll make sure the comment on that commit cites the place it came from, so at least the human running git log will know what second command to run to follow it further back.

warner added a commit that referenced this issue Feb 25, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is `@agoric/swingset-xs-supervisor`,
and lives (for now) in packages/swingset-xs-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that eah published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing out use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
warner added a commit that referenced this issue Feb 26, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is `@agoric/swingset-xs-supervisor`,
and lives (for now) in packages/swingset-xs-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that eah published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing out use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
warner added a commit that referenced this issue Feb 27, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is `@agoric/swingset-xs-supervisor`,
and lives (for now) in packages/swingset-xs-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that eah published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing out use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
warner added a commit that referenced this issue Mar 1, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is
`@agoric/swingset-xsnap-supervisor`, and lives (for now) in
packages/swingset-xsnap-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that eah published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing out use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
warner added a commit that referenced this issue Mar 1, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is
`@agoric/swingset-xsnap-supervisor`, and lives (for now) in
packages/swingset-xsnap-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that eah published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing out use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
warner added a commit that referenced this issue Mar 8, 2023
This moves the xsnap-worker-creation code out of SwingSet and into a
new package, whose NPM name is `@agoric/swingset-worker-xsnap-v1`, and
lives (for now) in packages/swingset-worker-xsnap-v1 .

This new package encapsulates:
* the choice of `xsnap` executable, hence the XS engine version inside
* the choice of `@agoric/xsnap-lockdown`, hence the Endo/SES version
* the choice of `@agoric/swingset-xsnap-supervisor`, hence liveslots

We roughly expect there will be exactly one version of
`@agoric/swingset-worker-xsnap-v1` published. Any major changes to
XS/xsnap/SES/liveslots will be published to a new
`@agoric/swingset-worker-xsnap-v2` package, and a single kernel can
depend upon both -v1 and -v2 at the same time, to support older
vats (started with -v1) as well as newer vats (started with, or
upgraded to, -v2).

Minor (and sufficiently backwards-compatible) changes of the
components may be possible. If so, we can publish new versions of -v1,
and deployed systems can perform orchestrated (within-consensus)
upgrades from e.g. -v1@1.0 to -v1@1.1 . We still need to think
carefully about potential differences in behavior between the
first-run deliveries (using 1.0) and the post-upgrade replayed
deliveries (using 1.1).

SwingSet now has a primary dependency on the new
`@agoric/swingset-worker-xsnap-v1` package, and only `devDependencies`
on the individual components (`@agoric/swingset-xsnap-supervisor` and
`@agoric/xsnap-lockdown`). Hopefully these dev-dependencies will go
away once the dust settles.

refs #6596
warner added a commit that referenced this issue Mar 8, 2023
This moves the xsnap-worker-creation code out of SwingSet and into a
new package, whose NPM name is `@agoric/swingset-worker-xsnap-v1`, and
lives (for now) in packages/swingset-worker-xsnap-v1 .

This new package encapsulates:
* the choice of `xsnap` executable, hence the XS engine version inside
* the choice of `@agoric/xsnap-lockdown`, hence the Endo/SES version
* the choice of `@agoric/swingset-xsnap-supervisor`, hence liveslots

We roughly expect there will be exactly one version of
`@agoric/swingset-worker-xsnap-v1` published. Any major changes to
XS/xsnap/SES/liveslots will be published to a new
`@agoric/swingset-worker-xsnap-v2` package, and a single kernel can
depend upon both -v1 and -v2 at the same time, to support older
vats (started with -v1) as well as newer vats (started with, or
upgraded to, -v2).

Minor (and sufficiently backwards-compatible) changes of the
components may be possible. If so, we can publish new versions of -v1,
and deployed systems can perform orchestrated (within-consensus)
upgrades from e.g. -v1@1.0 to -v1@1.1 . We still need to think
carefully about potential differences in behavior between the
first-run deliveries (using 1.0) and the post-upgrade replayed
deliveries (using 1.1).

SwingSet now has a primary dependency on the new
`@agoric/swingset-worker-xsnap-v1` package, and only `devDependencies`
on the individual components (`@agoric/swingset-xsnap-supervisor` and
`@agoric/xsnap-lockdown`). Hopefully these dev-dependencies will go
away once the dust settles.

refs #6596
warner added a commit that referenced this issue Mar 11, 2023
This moves the xsnap-worker-creation code out of SwingSet and into a
new package, whose NPM name is `@agoric/swingset-worker-xsnap-v1`, and
lives (for now) in packages/swingset-worker-xsnap-v1 .

This new package encapsulates:
* the choice of `xsnap` executable, hence the XS engine version inside
* the choice of `@agoric/xsnap-lockdown`, hence the Endo/SES version
* the choice of `@agoric/swingset-xsnap-supervisor`, hence liveslots

We roughly expect there will be exactly one version of
`@agoric/swingset-worker-xsnap-v1` published. Any major changes to
XS/xsnap/SES/liveslots will be published to a new
`@agoric/swingset-worker-xsnap-v2` package, and a single kernel can
depend upon both -v1 and -v2 at the same time, to support older
vats (started with -v1) as well as newer vats (started with, or
upgraded to, -v2).

Minor (and sufficiently backwards-compatible) changes of the
components may be possible. If so, we can publish new versions of -v1,
and deployed systems can perform orchestrated (within-consensus)
upgrades from e.g. -v1@1.0 to -v1@1.1 . We still need to think
carefully about potential differences in behavior between the
first-run deliveries (using 1.0) and the post-upgrade replayed
deliveries (using 1.1).

SwingSet now has a primary dependency on the new
`@agoric/swingset-worker-xsnap-v1` package, and only `devDependencies`
on the individual components (`@agoric/swingset-xsnap-supervisor` and
`@agoric/xsnap-lockdown`). Hopefully these dev-dependencies will go
away once the dust settles.

refs #6596
@warner
Copy link
Member Author

warner commented Mar 17, 2023

I'm thinking about how the repos and packages and workspaces ought to work, and how a "test against current versions of everything" (#7056) should work.

Tools at our disposal:

  • yarn (v1) "workspaces" get you multiple cross-referencing packages in a single repo:
    • the top-level yarn.lock is used, ignoring any per-package yarn.lock
    • the top-level node_modules/ will contain symlinks to local package directories
      • if the (package.json).version of the local package satisfies the constraint
      • else e.g. packages/depender/node_modules/dependent@othervesrion will be populated, from NPM
  • the yarn (v2) workspaces has more features, which might make this easier if we could switch to v2 (eventually move to Yarn 2+ (currently 4) #451)
    • nested worktrees, so e.g. swingset could be a subtree, with the various packages inside it
    • rewriting workspace: dependencies into the just-published versions at packaging/publish time
  • the package.json dependencies section can contain aliases, basically "ALIAS": "PACKAGE@VERSION"
    • so e.g. the kernel could say "worker-xsnap-v1": "@agoric/swingset-worker-xsnap@1.0.0" and "worker-xsnap-v2": "@agoric/swingset-worker-xsnap@2.0.0" at the same time, and maybe we could avoid the need for distinctly named worker packages
  • we could use yarn link to create a symlink from node_modules/XYZ (in one git checkout, e.g. agoric-sdk) to a nearby but separate git checkout (e.g. swingset-liveslots)
    • folks working on kernel code might use this while developing a new version of liveslots/etc
  • git submodule lets you place one entire git repo in a subdirectory of another
    • our packages/xsnap/ subdirectory currently uses this to grab both the Moddable XS library code, and the xsnap executable's C code
    • a .gitmodules file is added, which tells git which subdirectories should be populated, and which repository URLs to use
    • a special piece of git metadata, committed just like regular files or directories, is used to record the exact commit ID (hash) should be used: note that this is not a branch name

The new package dependency graph is:

graph TD;
  swingset-kernel --> swingset-worker-xsnap-v1;
  swingset-worker-xsnap-v1 --> xsnap;
  swingset-worker-xsnap-v1 --> xsnap-lockdown;
  xsnap-lockdown --> endo/init;
  swingset-worker-xsnap-v1 --> swingset-xsnap-supervisor;
  swingset-xsnap-supervisor --> swingset-liveslots;
Loading

and the goal is to allow swingset-kernel to depend on multiple versions of swingset-worker-xsnap-vNN at the same time, each of which will need very specific versions of its dependencies.

"thorough but painful" approach

One potential approach, with obvious drawbacks:

  • make new single-package repositories for all of:
    • swingset-liveslots
    • swingset-xsnap-supervisor
    • xsnap-lockdown
    • xsnap (which uses git submodules of moddable/xs and xsnap-pub)
    • swingset-worker-xsnap-v1
  • each package has precisely-versioned dependencies, and a yarn.lock file
  • create releases (bundles) of swingset-xsnap-supervisor and xsnap-lockdown from standalone checkouts
    • the yarn.lock will be honored because there is no surrounding workspace to override it
  • after making a change to any package, publish it to NPM, with a new version number
    • then climb the graph to the importing package, update it's dependency to match the new child, publish it to NPM, with a new version number
    • repeat until swingset-worker-xsnap has a newly published version, and then decide whether the kernel will change from e.g. 1.0 to 1.1 (making the claim that existing vats do not need to be upgraded), or if it will add a new worker type like -v2 (leaving existing vats alone until the application chooses to upgrade them)

This would achieve our stability goals, but development would be horrible: every commit would have to go through NPM before the parent package could use it. We'd be publishing broken versions because we couldn't actually test the combination of new-child + new-parent before committing to the new-child patch.

"special kernel development workspace" approach

We might create a separate repo for use by kernel developers. It would define a yarn-v1 workspace at the top, and use git submodules to link in the five child packages so they'd all sit next to each other inside the workspace. If we made the cross-package dependency version constraints more lenient, then yarn install would satisfy them with symlinks. That would let us make matching changes in both child and parent package without going through NPM for each commit.

One problem is that git submodule leaves each embedded git repo checked out on a "detached HEAD": it records the commit ID (hash), but not a branch name. So when a developer commits a change in one repo, it isn't associated with any branch (like master or a feature branch to submit in a PR). We'd have to go through some ugly hoops: after the initial git submodule update clones everything and switches to the recorded commit ID, we'd need a script to change the submodules to a named branch. We'd also need it to change the git remote specification to allow developers to push to GitHub from the submodule (normally the .gitmodule file refers to a read-only HTTPS URL).

The non-kernel agoric-sdk code would continue to use a kernel from NPM, making a boundary between swingset and its clients that would have to go through NPM for each change (same problem as above, but at a different boundary, hopefully one which changes more slowly). Devs who want to test both sides could use yarn link to associate the two repos.

"yarn link all the things"

Another approach would be to use yarn link to associate multiple git repos within a single developer's working environment:

  • git clone agoric-sdk
  • next to it, git clone swingset-liveslots, repeat for the other swingset repos
  • starting from the bottom of the dependency graph, in each repo:
    • run yarn link, to register this repo as providing the package
      • (running yarn link, with no arguments, inside package X, basically creates a registry in ~/.config/yarn/link/ that says "if anything else runs yarn link X, make their node_modules/X be a symlink to this repo")
    • run yarn link PACKAGENAME for all of the child packages which this one depends upon
  • then, in agoric-sdk, run yarn link @agoric/swingset-vat to allow sdk packages to use the kernel from the nearby repo

Depending on the versions and constraints, this would replace e.g. swingset-xsnap-supersivor's node_modules/@agoric/swingset-liveslots/ subdirectory (as installed from NPM) with a symlink to the nearby repo. We could the make changes in liveslots, cd to swingset-xsnap-supervisor, re-run yarn build to get a new bundle, etc.

When it comes time to publish a version to NPM, we'd use yarn unlink PACKAGENAME to remove all the symlinks, then do npm publish of each package, from the bottom of the graph and upwards.

"one repo but mess with workspaces a lot"

In this approach, we continue to use a single monorepo, but we use a release process that temporarily changes the definition of the workspace.

  • move all swingset packages into a subdirectory: packages/swingset/swingset-kernel, packages/swingset/swingset-worker-xsnap, etc
  • change the top-level (package.json).workspaces glob list to include packages/swingset/*
  • change the swingset packages' cross dependency constraints to * ("accept any version")
  • change the swingset packages' versions to some 1.N.dev0 suffix, which sorts larger than 1.N
    • we want a dependency on 1.N to not be satisfied by the local 1.N.dev0
  • in the kernel package, use an alias dependency worker-xsnap-dev -> @agoric/swingset-worker-xsnap@*
    • this one will be satisfied by the workspace, with a symlink
    • the stable versions will use an alias dependency like worker-xsnap-v1 -> @agoric/swingset-worker-xsnap@1.0
      • these will be satisfied by an NPM install, with it's pre-published (stable) bundle

Development can continue in the tree, and running tests with WORKERTYPE=dev (or however we spell the environment variable that tells the kernel what managerType/workerType to use by default) will get the local versions (assuming you remembered to yarn build to update the bundles). Running tests without the environment variable wil get worker-xsnap-v1 (or whatever the latest published version is).

When it comes time to release the swingset packages, the sequence will look like:

  • create and checkout a temporary swingset release branch
  • edit the top-level package.json to remove packages/swingset/* from the workspaces: key
  • commit the package.json change
  • delete all node_modules directories, so they don't interfere
  • iterate through all the swingset packages, bottom-up
    • inside that package:
    • remove the yarn.lock file
    • yarn install to regenerate it, and to populate node_modules (from NPM, not the workspace, since we're no longer in a workspace)
    • look at the updated yarn.lock to find the versions of the other swingset packages that it found
    • copy those version constraints into package.json
    • yarn install again, nothing should change
    • commit the new yarn.lock contents
    • run yarn build, to generate the bundles and their hashes
    • bump the package.json version from 1.N.dev0 to 1.(N+1)
    • commit the package.json changes
    • git tag
    • run yarn publish, publish the generated+stable bundle to NPM
    • bump the package.json vesrion to 1.(N+1).dev0
    • replace the other swingset package dependency versions with *
    • commit package.json
    • now move up the graph
  • after finishing all the swingset package publishes:
  • revert the change to the top-level package.json (bringing the swingset packages back into the workspace)
  • commit the top-level package.json change
    • (remember we're still on the temporary release branch)
  • clobber packages/swingset/*/node_modules
  • "yarn install" to repopulate node_modules
  • that will probably change the top-level yarn.lock (with new swingset-package versions)
    • so commit that yarn.lock change
  • now merge the temporary swingset release branch back into trunk

Note that all of the swingset package release tags will point to commits on the side branch, and each will point to a distinct commit.

We must also change the agoric-sdk release/publish process:

  • it must not attempt to publish the swingset packages, or modify their versions
  • probably start by editing the top-level package.json (to remove the swingset packages from workspace:)

The swingset packages will spend most of their time at .dev0 versions. In particular, the swingset-worker-xsnap package will be at .dev0, which means the kernel package's worker-xsnap-v1 alias dependency will not be satisfied by the version in the workspace, and can only be satisfied by the NPM-published version (with its stable hashed pre-bundled contents).

"something using a newer Yarn"

We're currently using yarn-v1, and support for workspaces has improved significantly in the later versions (currently yarn-v3.5.0). One improvement is the ability to use nested workspaces, called "worktrees". We could use this to nest a set of swingset packages in one workspace, under the general agoric-sdk workspace.

Another improvement is a dependency constraint syntax that says "only use something from my same workspace", spelled like workspace:*. This would let e.g. swingset-worker-xsnap to depend on "swingset-supervisor-xsnap": "workspace:*", so the local development would strictly use code from within the tree. In addition, when publishing packages, yarn can dynamically replace those constraints with versions being exported by the dependencies. This would remove the need to manually change package.json files back and forth during the release process.

which one?

I'm still trying to figure out a good plan here.

@warner
Copy link
Member Author

warner commented Mar 17, 2023

Oh, before I forget, any change which creates new repos or workspaces will need to consider the patches/ directory, which (in conjunction with patch-package) we use to modify some of our dependency packages before use. We patch a lot of Endo packages, and those will be incorporated into the xsnap-lockdown bundle.

@FUDCo
Copy link
Contributor

FUDCo commented Mar 18, 2023

I haven't completely absorbed this, but a couple of somewhat scattered thoughts which possibly don't necessarily reflect the strongest possible understanding of the yarn/npm/git/blahblah toolchain:

  • I'd prefer whatever we do to avoid git submodules. In my experience they cause nothing but grief.

  • If we are clear on major vs. minor versioning, we could have V1, V2, Vn (i.e., major version) worktrees, which are maintained in a relatively stable state (i.e., no significant development goes on there but bug fixes could change them). We also have a Vdevel worktree which is where we do most of the dev work and where we spend most of our time, and our day-to-day development activity would work more or less the way it does now -- in the committed master branch Vdevel code is expected to be kept in a non-broken state (i.e., exactly as we do now), but even though the code is functional and as correct as possible at the time, it's not published to npm or released to the world (aside from being in a git repo where somebody can grab it if that's what floats their boat). When we do a minor version release (presumably updating the most recent major version) we have a process that merges changes into the Vwhatever worktree, generates packages, pushes to npm, etc. This process should be automated but might require a little hand coaxing from time to time, much as git branch merges do now. When we do a major version release we create a new Vnewest worktree, copy the Vdevel stuff into it wholesale, then proceed as we do with a minor release. This process might require a fair bit of manual activity, but presumably would be rare enough that investing in significant automation just for this is probably not a win. All this is essentially replicating the familiar git work-branch vs. master-branch pattern but at one higher meta level.

@mhofman
Copy link
Member

mhofman commented Mar 20, 2023

This is the first time I'm actually thinking about this problem, and I may be missing something, but I don't understand the direction taken right now, or the process taken to get there. In particular, a single "worker-v1" package amalgaming xsnap, endo, liveslots, and various dependencies.

I'm having an easier time thinking of it this problem in terms of API surface and upgradable units, and starting from the program itself. I also want to think about this as what the right approach conceptually should be, but it may not be feasible right now without other work throughout our stack.

Code running in a worker

Vat program

If we look at a JavaScript program executing in a vat, it has dependencies on the environment created by liveslots and endo, and possibly a dependency on a certain set of JS features. The program may also have dependencies on remote objects or devices, which are not part of the environment, but may be part of the initial execution of the program (currently defined as buildRootObject for a liveslots program). What it doesn't have is a direct dependency on xsnap, the kernel itself, or cosmic-swingset.

While these are dependencies in the conceptual sense, from the perspective of the program's package.json, all these are part of the execution environment, and not part of the package's "dependencies". There are 2 other package.json features that fit better to express these dependencies: "peerDependencies" and "engines". "peerDependencies" is used for packages that are relied on by the program (possibly imported), but which are expected to be installed / provided by the surrounding environment. "engines" roughly corresponds to the set of features available ambiently (either global, but also "built-in" packages).

One approach would be to have an "endo" engine version which describes the set of hardened JS features available in the locked down execution environment (as a combination of JS features implemented by the XS engine, and the modifications brought by @endo/init). To express how the program is executed, we'd need to use a second "engines" field indicating this program is meant to be invoked through "liveslots"'s buildRootObject. These "engines" dependencies would not result in code being included in the bundle of the program. However we'd need these "engines dependencies" to be documented by the bundle so they can be at least verified for compatibility by the kernel. I believe this information is currently lost when the bundle is created.

Zoe contracts

But looking at liveslots programs is only part of the picture. Most of the programs running in a vat are actually Zoe contracts, which rely on a specific version of the Zoe Contract Facet to invoke it (through an exported start or prepare) with a ZCF object of the right shape passed as argument. In that sense, the "engines" field of these programs should actually be describing ZCF, not liveslots. However Zoe contracts still likely rely on features provided by "liveslots", and an "endo" flavored hardened JS execution environment. This raises the question of how these "external" dependencies should be expressed, and which part of the system is responsible to enforce them.

Let's take an example. Let's say ZCF itself relies on the basic "liveslots@^1.0" feature set and documents it as such in its "engines". However we introduce a new liveslots feature in a backwards compatible way (e.g. new global), and bump the liveslots version to 1.1.0. The kernel/liveslots deals with executing ZCF, and could setup or verify the version of liveslots compatible with ^1.0. However the kernel/liveslots doesn't know anything about the contract or its requirements. ZCF would need to be the one responsible to check that the current version of liveslots is compatible with the contract's requirements, however ZCF itself cannot setup the environment, as it's already executing in it, only the vat creator has that opportunity.

There is also the question of how the Zoe contract expresses its reliance on liveslots@^1.1.0. Since it's not executed directly by liveslots, the "engines" field may not be appropriate anymore, and "peerDependencies" may be a better fit. However it seems that endo's bundler may currently attempt to bundle peerDependencies the same as regular dependencies (see endojs/endo#1355).

Supervisor bundle

On the other side, we have the supervisor bundle. This currently evaluates as a script, but it may not stay like that (for example it could transition to an endo worker which I believe expects a main export). It relies on some JS features and a hardened JS environment with Compartment for example, as well as some globals to interact with the runtime and kernel (e.g. issueCommand and metering for xsnap workers). It also contains and sets up "liveslots" which will be used by the program.

For an xsnap supervisor, an "xsnap" entry for the "engines" would make sense to document the ambient APIs required. Similarly, if we introduce a sqlite directly in the worker process, having a way to document this external dependency would be needed. However what may not make sense as an "engines" requirement is the protocol version that liveslots speaks with the kernel over issueCommand/handleCommand, as this is something that fits the definition of "peerDependencies" much closer. Regardless, this "protocol dependency" is a requirement that liveslots and thus the supervisor bundle would need to express, and it needs to be fulfilled by the kernel when creating the vat. At some point we'll likely need the kernel to support multiple versions of this liveslots protocol. However this protocol version may be incremented independently of the liveslots version itself as required by the vat.

For example, we may move vatstore inside the worker, which would likely be a breaking change in the liveslots protocol between the kernel and the supervisor, and well as possibly in the protocol between the kernel and the worker process. However this may be transparent on the vat program itself, and only necessitate a minor version update of "liveslots" from its point of view.

Worker

This is the final and lowest level component in a vat, the one providing a hardened JS execution environment for the supervisor bundle, as well as the low level transport with the kernel (some of which is exposed to the supervisor, like the commands interface, and some which is only exposed to the kernel, like snapshots). As such it may provide 2 "engines" versions to the supervisor (e.g. "endo" + a worker specific one like "xsnap"), but it also needs some independent versioning with the kernel.

For example, a breaking change in an xsnap worker would be to change the snapshot format, but that would have no bearing on the supervisor or vat bundles executing in the vat (beyond the fact that changing the xsnap worker version requires going through the "upgrade" process for the vat program since we cannot restore the program's heap).

On the other side, a change to the metering API may be a breaking change between the xsnap worker and the supervisor, but may not impact the transport between the xsnap-worker process and the kernel.

Version requirements and checks

All this suggests that when creating or upgrading a vat, the "endo" and "liveslots" versions should be specified explicitly or derived from the bundle. However the xsnap or liveslot protocol versions are in fact transitive dependencies that do not belong expressed or checked at that layer. The major version of the worker would not be part of the options either: a null upgrade would implicitly use the latest major version of the worker compatible with the required "endo" and "liveslots" versions, and it then becomes a question of who triggers the null upgrade.

Swingset would need to handle multiple supervisor bundles and worker major versions. However these 2 don't have to be merged together (but can be to simplify things). More importantly, their versions is not directly equivalent to the "endo" and "liveslots" versions that the vat bundle requires.

This points to the need for the following mechanisms:

  • Helper to extract what direct external dependencies a package/bundle requires. This can help verify or provide the needed environment for a bundle.
  • Helper to extract what "dependencies" a package/bundle provides. This can help decide which version to use that is compatible with the requirements of a nested component.
  • A way to discover which dependencies are currently present in the environment. This allows a component to verify that the dependencies of its nested components are compatible.

For example we could have the following layering:

  • xsnap-worker@1.1.3: requires xsnap-worker-protocol@^1.2.0, provides endo@2.0.0, metering@1.0.2, commands-api@1.3.4
  • xsnap-worker@1.1.4: requires xsnap-worker-protocol@^1.3.0 (supports streaming snapshot), provides endo@2.0.0, metering@1.0.2, commands-api@1.3.4
  • xsnap-worker@2.0.0: requires xsnap-worker-protocol@^1.3.0, provides endo@3.0.0 (native Compartments), metering@1.0.2, commands-api@1.3.4
  • xsnap-supervisor@1.4.4: requires endo@2, metering@^1.0.0, commands-api@^1.0.0, liveslots-protocol@~1.2.0, provides liveslots-api@1.1.6
  • xsnap-supervisor@1.4.5: requires endo@2 || endo@3 (checked compatibility), metering@^1.0.0, commands-api@^1.0.0, liveslots-protocol@~1.2.0, provides liveslots-api@1.1.6
  • zcf-bundle@5.2.1: requires endo@>=2, liveslots-api@^1.0.0, provides zcf-api@5.2.0
  • contract bundle: requires endo@>=1, liveslots-api@^1.1.2, zcf-api@^5.0.0

At first, only the version 1.1.3 of xsnap-worker exists.

  • Someone uses the zoe API to instantiate the contract bundle.
  • The Zoe service extracts the requirements from the contract bundle, selects zcf-bundle@5.2.1 and requests the creation of a vat with it, with requirements endo@>=1 and liveslots-api@^1.1.2.
  • Vat admin extracts the requirements from the zcf bundle, and requests a vat from the kernel with requirements endo@>=2 and liveslots-api@^1.1.2.
  • The kernel extracts the provides from xsnap-worker@1.1.3 and xsnap-supervisor@1.4.4, verifies that it matches the vat requirements (including that the kernel is implementing the correct liveslots-protocol), creates the new vat, executes the supervisor, and the zcf bundle.
  • The Zoe service send a request to ZCF to load the contract bundle
  • ZCF extracts the requirements from the contract bundle, verifies they match the discovered versions available in the environment and executes the contract bundle

In a chain upgrade, version xsnap-worker@1.1.4 is introduced. The version is fully compatible, and is automatically used to restart existing workers

In another chain upgrade, xsnap-worker@2.0.0 is introduced, which switches to native compartments. Given the change, this is a breaking for both snapshot content and the lockdown bundle. A new version of xsnap-supervisor is introduced which only bumps the endo dependency version after verifying compatibility. A null upgrade of the vats is required. A new vat not compatible with the new endo version would still use the old worker, but new or upgraded vats which are compatible can switch.

  • A null upgrade is initiated for our existing contract vat (either explicitly, or through some policy, since vat admin is aware of the endo and liveslots requirements of the vat)
  • The kernel extracts the provides from xsnap-worker@2.0.0 and xsnap-supervisor@1.4.5, verifies that it matches the vat requirements, and starts a vat null upgrade, which executes the supervisor, and the existing zcf bundle.
  • the zcf zygote as part of its upgrade process extracts the requirements from the contract bundle (stored in baggage), verifies they match the discovered versions available in the environment and executes the contract bundle

Handling multiple versions

This requires the kernel to be aware of multiple versions of xsnap-worker / supervisor. Right now those cannot be introduced dynamically, so we can rely on source code changes in the kernel every time the xsnap-worker or supervisor changes.

One approach could be to always rely on NPM installed versions for the non latest package, and some dependency aliasing so that the kernel can depend on multiple versions of these packages. This does make it complicated to do patch or minor updates to the non-latest version, and does not prevent the supervisor from changing due to repo effects, unless the supervisor is published as a bundle.

However since we've disconnected the xsnap package and supervisor versions from the endo and liveslots versions they provide, I'm wondering if we actually need actual publishable node packages for these.

On the xsnap side, we really have a binary and a JS spawner. The binary is mostly in an external repo already, and we could keep the above scheme, aka install a binary version published by an xsnap repo which just builds xsnap-pub with the right moddable SDK. For "latest" work, we could keep a way to build the binary from a submodule instead. Then on the JS spawner side, we could have separate non published packages or even more simply source files that correspond to the various "xsnap-worker" versions, which would internally reference the required binary. These JS spawners could share some source code where applicable.

Similarly, for the supervisor, if we're careful to not have dependencies influenced by the rest of the repo, we could keep the different versions as sibling folders and files in the repo. To ensure we don't break our requirements of no changes to previous versions, we could either check-in the resulting bundle, or a hash of the expected bundle, for each version.

@ivanlei ivanlei removed the MUST-HAVE label Mar 20, 2023
@warner
Copy link
Member Author

warner commented Mar 20, 2023

I had a good brainstorming session with @michaelfig and @mhofman this morning, and I think we have a better plan than before:

  • we'll land swingset-xsnap-supervisor (extract swingset-xsnap-supervisor out to a separate package #7063), but abandon swingset-worker-xsnap-v1 (create new swingset-worker-xsnap-v1 package #7101)
  • we'll leave everything in the monorepo, no changes to the release process
  • we'll enhance @FUDCo's new bundleStore to accept nestedEvaluate bundles too (with an ID that's probably the hash of the JSON encoding of the whole bundle object)
  • when the kernel is initialized, we'll sample the lockdown and supervisor bundles, store them in the bundleStore under their hash IDs, and add a pair of kvStore keys to remind the kernel of what the "current lockdown/supervisor bundle ID" are
  • when creating a new vat, the kernel will copy these IDs into some per-vat metadata (in options)
  • when launching a vat worker, we'll use the metadata IDs to pull the lockdown/supervisor bundles from the bundleStore and feed them to the new worker, instead of using fixed copies

That will get us stability of the two bundles for the lifetime of a vat incarnation. The stability is low-level (precise bundle contents, as opposed to a version string), which is good. We won't directly have a semver/human-meaningful name for that behavior: we'll talk about "the liveslots that shipped with agoric-sdk@1234" instead of "liveslots@456", but that's probably ok.

When we perform the agoric-sdk release process, we'll get a new semver version string for each bundle-creating package, and we'll create (and hash) a bundle as well. We won't learn the bundle's hash until that moment. The bundles ought to be reproducible: clone the SDK at that commit, that will include a particular yarn.lock, running yarn install ought to get all the same packages/versions into node_modules, then yarn build should be generating a bundle from all the same components.

Updating something in the agoric-sdk (e.g. with yarn add, modifying the yarn.lock) may or may not change what goes into the bundles. We believe that yarn's dependency resolver tries hard to retain the original versions: consider a package that lives in the transitive dependency set of both xsnap-lockdown and e.g. zoe, and then zoe is changed to require a newer version (which would still be accepted by the xsnap-lockdown constraint), we think yarn will include both versions rather than de-duplicate them (by increasing the version that xsnap-lockdown gets to one which means both packages' constraints). However we may want to deduplicate, just to clean things up, and that certainly would change the files that go into the bundle. We should be able to inspect the bundle hash (printed to stdout during yarn build) to tell whether something changed or not.

TODO: make sure the generated bundles are included in the NPM package (run yarn pack and inspect the tarball), change the package.json files: section if necessary.

After the vaults release, we'll need to decide how we get the second version of the bundles into place. One option is a controller.updateBundles() call, which would:

  • query the current versions of the bundle-bearing packages (xsnap-lockdown and swingset-xsnap-supervisor) for their current hash
    • if that differs from kernelKeeper. get current bundle IDs (however it's spelled), then:
      • pull the bundle contents, store them in the bundleStore
      • replace the "current bundleIDs" kvStore entries with the new ones
  • vats created after this point will use the new bundles
  • change vat-upgrade to use the current bundle IDs too
    • so vats upgraded after this point will use the new bundles

Other options/variants:

  • perform controller.updatebundles() on every kernel launch
    • this might change swingstore state, so the host app must be prepared for that (commit afterwards, etc)
  • provide vat-upgrade options to choose a specific version of the bundles
    • harder to make ergonomic without semver names for the hashes

Note: a top-level yarn build will build each package in alphabetical order. We don't have bundles that include other bundles, but if we ever did, we'd need to make sure the leaves got build before the higher nodes of the dependency/inclusion graph, so e.g. xsnap-lockdown and swingset-xsnap-supervisor need to be built early.

xsnap

For now, we won't record any per-vat metadata to describe which version of xsnap (and thus which version of XS) is used. All workers will use the same xsnap, regardless of whether they were created weeks ago or just now.

This will constrain our ability to change XS/xsnap: any change we make must be sufficiently compatible to not cause problems when replaying a transcript (the old XS and the new XS must match transcripts) (this may be easier once #7142 is fixed and we're more confident that liveslots ignores GC timing variations).

We'll pretend that there's a per-vat metadata key that says which version of xsnap to use, and that it's currently set to "0" (i.e. lack of a key means "0"). Later, when we want to introduce a not-so-compatible version of XS/xsnap, we'll start using this key. At that point, we'll either need separately named xsnap packages (@agoric/xsnap-v1 vs @agoric/xsnap-v2), or we'll use the package.json > dependencies: aliasing feature to let the kernel code import multiple package name simultaneously, even though the package.json maps those onto different versions of the same @agoric/xsnap package name.

That is the point at which we'll figure out the semver attributes of these different xsnap packages, what "minor version has changed" means w.r.t. compatibility, etc. We can avoid committing ourselves to some of those details until we need the second version/package.

At that time, we'll update the kernel to record "current xsnap version identifier" into the DB, and use it when launching the worker. My original plan was to change the managerType into a worker field, giving it values of local and xsnap-v1, and in that approach, this is the point where we'd introduce xsnap-v2 (which would encompass all of XS, xsnap, lockdown, supervisor/liveslots).

In our new thinking, xsnap-v2 could refer to just the xsnap binary and the embedded XS library, letting you control both lockdown and supervisor as independent knobs.

Currently, package/xsnap/ hosts two git submodules, one named moddable (pointing at https://github.com/agoric-labs/moddable.git, our fork of the XS SDK), and a second named xsnap-native (pointing at https://github.com/agoric-labs/xsnap-pub, our fork of Moddable's xsnap repo). We're thinking it would be a good idea to make xsnap-pub host the moddable git-submodule, rather than packages/xsnap/ (changing the git-submodule graph from (A->B, A->C) to (A->B->C)). packages/xsnap/ knows how to build the xsnap binary, and knows the path to that binary, and has the API code which spawns a child process from that binary, plus the API code which speaks the netstring pipe to it.

Then, the git commit ID of the xsnap-pub submodule can serve has a name for the xsnap+XS behavior, and the commit ID of the lower-layer moddable submodule is a narrower name for the XS behavior, either of which might wind up in the per-vat metadata some day.

warner added a commit that referenced this issue Mar 20, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is
`@agoric/swingset-xsnap-supervisor`, and lives (for now) in
packages/swingset-xsnap-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that each published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing our use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
@mhofman
Copy link
Member

mhofman commented Mar 20, 2023

  • when the kernel is initialized, we'll sample the lockdown and supervisor bundles, store them in the bundleStore under their hash IDs, and add a pair of kvStore keys to remind the kernel of what the "current lockdown/supervisor bundle ID" are

I believe @michaelfig suggested we could do this lazily and just load the bundle and hash from disk when needed by a new vat.

a top-level yarn build will build each package in alphabetical order

Is that true for all packages or only the ones with circular dependencies? I thought yarn correctly handled the order for packages that have no circular deps.

warner added a commit that referenced this issue Mar 21, 2023
This extracts the XS supervisor bundle out of packages/SwingSet and
into a new package whose NPM name is
`@agoric/swingset-xsnap-supervisor`, and lives (for now) in
packages/swingset-xsnap-supervisor .

The bundle is created explicitly, by running `yarn build`. As with
`xsnap-lockdown`, the intention is that each published version has a
stable bundle, with specific (hashable) contents. Importing clients
can rely upon this stability to e.g. have deterministic xsnap heap
snapshot contents even if the versions of other parts of the system
have changed.

The new README.md describes the API, the stability goals, and how we
achieve them.

Fairly soon, we'll need manage the version number of this package more
explicitly, either by moving it to a different git repository, or
changing our use of `lerna publish` to avoid spurious changes. But for
now we can leave it in place.

refs #6596
@warner
Copy link
Member Author

warner commented Mar 21, 2023

  • when the kernel is initialized, we'll sample the lockdown and supervisor bundles, store them in the bundleStore under their hash IDs, and add a pair of kvStore keys to remind the kernel of what the "current lockdown/supervisor bundle ID" are

I believe @michaelfig suggested we could do this lazily and just load the bundle and hash from disk when needed by a new vat.

Yeah, as I wrote up my notes, I got worried about when exactly the DB writes would occur, and how we explain this to host-application authors. Let's see, I can think of three approaches:

  • A: sample during initializeSwingset(), and also during some sort of upgradeSwingset(swingStore) function that we can figure out and implement later (some time before we start using the new bundles)
  • B: sample every time we build the controller
  • C: sample every time we create a new vat

I need to be able to write documentation aimed at host-app authors to explain when state changes might happen, and what their hostStorage.commit() responsibilities are. We need well-defined windows of time during which the kernel might make changes to the swingstore, after which the host app must do their commit.

"A" is the one that most closely matches the current pattern. It would always sample and stash the bundles during initializeSwingset(), at the same time it populates a bunch of other swingstore data (config.vats, bootstrap message on otherwise-empty run-queue, etc), and the host app already knows it needs to commit after this initialization phase, no surprises there. And requiring an explicit upgrade call would introduce another such phase, but only when the host app wants to do an upgrade. The host app would look more like:

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
if (timeToUpgrade) {
  updateSwingset(kernelStorage);
  hostStorage.commit();
}
const c = buildVatController(kernelStorage);
// .. go run blocks

DB writes are a deterministic function of the package upgrade schedule and the host app calling the "upgrade" function.

"B" might be convenient, but the host app would need to do a commit at every launch, just in case this was the time that a new bundle was found. DB writes would be a function of package upgrade schedule and the controller incarnation schedule (process start to process termination). buildVatController() does not currently make any state changes, and I'm not sure I want it to.

The host startup would look something like:

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
const c = buildVatController(kernelStorage);
hostStorage.commit(); // in case bundle upgrades happened
// .. go run blocks

"C" puts the changes in the middle of a controller.run(), so no problems with knowing when to commit. It would mean that the kernel must be given the ability to read the bundle files (encapsulated in some form of the startXSnap endowment, which also needs access to the bundleStore so it can store the sampled bundle, and read them back again later). The DB write will happen kinda haphazardly, during the first vat creation after each controller restart that happens after the on-disk bundle-supplying packages are changed. If a given controller incarnation doesn't create any new vats, the sampling will not happen, even if the package versions have changed. So the DB writes are a deterministic function of both the package upgrade schedule and userspace actions (creating new vats), which is kind of complicated.

It also adds a new failure mode: if someone launches the kernel, then modifies their source tree (e.g. checks out a different version, maybe temporarily), the DB state will depend on a race between their interference and a vat being created. If we used A or B, any potential sampling would be complete by the time kernel startup finished, so the window of vulnerability (to this admittedly "don't do that" case) is a lot smaller.

My Python t-shirt says "explicit is better than implicit":

% python3
Python 3.9.6 (default, Oct 18 2022, 12:41:40)
[Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
>>>

But the annoying thing about "A" (explicit) is that we have to add a cosmic-swingset ActionType.UPGRADE_SWINGSET "action" to trigger the upgradeSwingset() call, invoked from the Golang-side upgrade handler function, sent to the JS side, and await it, before the remainder of the cosmos upgrade can continue. With "B" or "C" all you have to do is change the JS packages on disk. And we removed the bundle-the-kernel-and-store-it step last year because we liked the simplification of "just git checkout a different version and restart".

Hrm. I know I don't want to do "B". I prefer to be explicit. I think I'm ok with e.g. swing-store either spontaneously upgrading itself (when newer code is launched against an older DB), or refusing to run and requiring a separate program be run to do the upgrade, which suggests that maybe I'm ok with the kernel spontaneously upgrading the bundles without an explicit upgrade() call, which means that maybe explicit is not so important to me after all. I'm uncomfortable with deferring the bundle reads until semi-random times long after kernel startup. We're not saving and significant space or time by loading the bundles lazily. But I don't want to implement the extra cosmic-swingset actions or coordination code to trigger the swingset upgrade, nor do I really want to add "call upgrade()" to the instructions for host-app authors (feels like a burden).

Ok, I'll go with "C". At least with these reasons written down, if I change my mind later, I should be able to think more clearly about it.

@michaelfig
Copy link
Member

A corollary to "C": do the sampling once on controller initialization, but don't write the results to the DB until the first time they're needed.

@mhofman
Copy link
Member

mhofman commented Mar 21, 2023

I don't fully follow the timing issues: The packages providing the bundles would be loaded when loading swingset, but their methods to load the bundle and check the bundle hash would be invoked every time a vat is created. The authority is already encapsulated, no need to provide new raw authority to read the filesystem.

If someone modifies their source tree in a way that will cause the existing methods to return new values, then they will just fall off consensus. Aka don't change your source tree while executing, any lazy loading will have similar constraints.

Edit: to summarize, "C", but I don't share the concerns.

@warner
Copy link
Member Author

warner commented Mar 21, 2023

Now that we have a new approach, I'm moving this ticket out of the milestone in favor of the (smaller) #7208 . We might choose to use this approach in the future, but we no longer think we need a distinct worker-v1 package for the Vaults release.

@kriskowal
Copy link
Member

I move to close this issue since it is no longer on the plan of record. We can revive the idea if it eventually becomes necessary to have more than one worker version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet vaults_triage DO NOT USE xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

6 participants