upgrading the kernel itself: controller.setKernelBundleID() #4375

warner · 2022-01-25T07:09:37Z

What is the Problem Being Solved?

We'll need an inline way to upgrade the kernel itself.

Currently, the kernel source code is bundled once during initializeSwingSet and stored in the kvStore under the kernelBundle key. Each time the kernel is launched, this bundle is given to importBundle to form the "kernel compartment".

I decided to keep this bundle around, rather than re-bundling the kernel source on each application restart, to 1: speed up restart (bundling can take a few seconds), and 2: reduce surprises when you update your source tree without resetting your chain or other application. During debugging sessions where we're replaying recorded chain state under modified kernels, we've needed to overcome this stickiness with tools like packages/SwingSet/misc-tools/rekernelize.js, to re-bundle and overwrite the kvStore entry. As a result, I was considering removing this feature, and have the controller re-bundle the kernel source code each time the application launches.

But, after working on #4372 bundlecaps, I realized that this stickiness is actually a feature, which would play nicely into a mechanism to cleanly upgrade the kernel itself. The idea is that kvStore['kernelBundle'] becomes kvStore['kernelBundleID'], and initializeSwingSet is responsible for bundling and installing the initial version. Later, when the application is told to upgrade the kernel, it needs to:

controller.installBundle(newKernelBundle) and get back newKernelBundleID
controller.shutdown()
controller.setKernelBundleID(newKernelBundleID)
(maybe build a new controller)
controller.start()

setKernelBundleID just checks that the bundleID is valid, and writes it into the kvStore. controller.start() reads the bundleID out of kvStore, loads the bundle itself, then does importBundle() as before.

Of course, it is critical that the new kernel can handle the persistent state in which it wakes up. It must look for kvStore flags that indicate whether particular features have been initialized or not. But the kernel is not obligated to mimic the behavior of some earlier version. The host application is responsible for triggering the upgrade at a consensus-managed moment, between blocks, so the new kernel version only has to be consistent with itself.

A separate issue is how e.g. cosmic-swingset should decide when an upgrade is appropriate. One option is to require an application upgrade, and have the new version pay attention to the block height. When the height reaches a pre-decided point, cosmic-swingset can shut down the kernel, call bundleSource() on the usual path packages/SwingSet/src/kernel/kernel.js, install the resulting bundle, then instruct the controller to use the new bundleID. This approach requires all validators to install the new application before the appointed cutover time, which is also what they would do to replace the Go code in cosmic-swingset, or other low-level non-JS code.

An alternate approach would be to use an in-band transaction to trigger the upgrade. Some external client could use signed txns to perform the controller.installBundle() ahead of time, just as they would install contract code. Then maybe a governance vote triggers the execution of some SwingSet-module code that performs the shutdown/setKernelBundleID/start. This would be driven by governance vote, and would not require validators to install any new software. The governing committee should be equivalent to getting all validators to replace their software, however, because the new kernel code gets nearly complete control over the chain. But the execution of the vote might be easier if it can be handled entirely within the governance module.

Description of the Design

Security Considerations

Replacing the kernel code is the most security-critical thing we can imagine, so both the implementation and the code that triggers it must be audited carefully.

Test Plan

unit tests

The text was updated successfully, but these errors were encountered:

warner · 2022-02-10T03:08:26Z

Some folks in today's kernel meeting (@michaelfig ? @FUDCo ?) expressed concern about upgrading the kernel without actually restarting the process. I can think of three approaches

In the first one, cosmic-swingset remains running, but it discards and replaces the controller object. The sequence is like:

some DeliverTxs are executed, delivering swingset messages as usual
then a special DeliverTx tells the controller to change the kernel bundle ID
- this updates a DB entry but doesn't change anything else
the rest of the DeliverTxs arrive for this block
EndBlock runs and the (old) kernel does it's usual amount of work
cosmic-swingset commits the block results as usual
now cosmic-swingset remembers that a kernel upgrade is pending: it does controller.shutdown(), then builds a new controller as if it was rebooting the node
the next block begins, and the remaining cranks run under the new kernel

In the second case, we do the same, but the entire cosmic-swingset process exits after committing the block results. systemd or whatever supervisory devops-ish parent they're using notices the process has died, and starts a new one. The new one begins using the new kernel when it calls makeSwingsetController() as usual.

In the third case, both the cosmic-swingset process and the controller remain running. The controller, however, knows how to shut down the old kernel and starts up a new one. It can do this within a single DeliverTx operation. The sequence would be:

some DeliverTxs are executed, delivering swingset messages as usual
then a special DeliverTx tells the controller to upgrade the kernel
- the controller updates the kernel bundle ID record in the DB and commits the crank buffer
- the controller shuts down the old kernel
- the controller creates a new kernel, from the new bundleID
- the controller waits for kernel.start() as if the node were rebooting
the rest of the DeliverTxs arrive for this block, pushing messages onto the run-queue
EndBlock runs and the (new) kernel does it's usual amount of work
cosmic-swingset commits the block results as usual

From the chain's point of view, the DeliverTx that specifies the kernel upgrade just takes an unusually long time. Any DB changes made by the kernel during upgrade are included in the block buffer that gets committed after EndBlock, at the same time they would have without the upgrade.

In the first two cases, some arbitrary number of kernel cranks (deliveries) are made after the upgrade event, but using the old kernel. This makes the consistency of the kernel state a function of when the host decides to end the block, whereas normally it doesn't depend quite so much on that decision.

In the third case, every crank executed after the upgrade command will happen with the new kernel, regardless of when the host runPolicy decides to end the crank, which seems more predictable.

warner · 2022-03-16T21:22:26Z

Add upgrade-kernel-bundle API next to initializeSwingset, to be called by host application in the v2 application before calling buildVatController. Do not add methods to controller. Kernel upgrade only happens between kernel invocations. We could still do c.shutdown() followed by upgrade followed by second buildVatController in the same process, but we don't think we want to use that.

Maybe use kernel bundleID so that an explicit hash shows up in the v2 application code.

MN-1 to MN-2 transition may not be the first. All upgrades will require replacing the validator code, which may or may not replace the kernel.

warner · 2022-04-27T23:50:17Z

After today's kernel meeting, @kriskowal and I figured that we might not need to make any code changes for MN-1, and we've sketched out some small code changes needed for the subsequent version

What we need to add in time for version-2 is something like import { reinitializeSwingset } from '@agoric/swingset-vat'. This function will re-bundle the kernel code (as well as lockdown and the xsnap supervisor) and update the DB with the new bundles. That's all.

Now the timeline of upgrade will be:

validators create their DB with version-1 code that runs the original initializeSwingset (and doesn't have reinitializeSwingset)
validators then launch their nodes with code that runs buildSwingsetController each time they reboot
at some point, version-1 will observe a governance action that schedules upgrade-to-2 at e.g. block height 1000
at some point, validators build version-2 and prepare their supervisors to use it after the current version-1 exits with an error
the last thing version-1 sees is the transition to block 1000, something inside cosmos or cosmic-swingset calls a function that asks "can you handle upgrade-to-2?", the answer is "no", and the process terminates
the supervisor sees version-1 exits, and launches version-2
the first thing version-2 sees is the same upgrade-to-2 query
- in response, cosmic-swingset calls reinitializeSwingset before calling buildSwingsetController
- reinitializeSwingset gets kernel source code from the new version-2 installation, and replaces the bundles in the DB
- buildSwingsetController loads the bundles from the DB as usual, so it gets version-2
validators are now using the version-2 kernel code (and lockdown and supervisor)

We'd like to confirm with @michaelfig that this plan will work, and we'd like to understand how cosmic-swingset currently implement the "can you handle?" check. But as long as the check currently says "no", we think we don't need any upgrade-helping code to go into version-1.

If so, we can defer this ticket indefintely, and/or close it entirely. If non-chain environments would use a similar "replace the whole process" approach for upgrade, then they wouldn't benefit from an in-place "live" kernel upgrade either.

michaelfig · 2022-04-28T00:07:52Z

We'd like to confirm with @michaelfig that this plan will work

IMHO, it looks like it would work just fine.

and we'd like to understand how cosmic-swingset currently implement the "can you handle?" check.

The cosmos-sdk "can you handle check" defaults to "no", and requires additional application wiring to change that to "yes" for a given upgrade named "XXX". The chief strength of the Cosmos upgrade system is that it is lazy, and the responsibility of our future selves to design as needed.

But as long as the check currently says "no", we think we don't need any upgrade-helping code to go into version-1.

That's right. The governance proposal would vote for a software upgrade at block 1000 to version-2, with human- and possibly machine-readable instructions for how to install the SDK that understands version-2. When block 1000 rolls around, the version-1 chain halts ("I don't know version-2"). It doesn't matter how many times you restart version-1, it just keeps halting.

But, if you start version-2 with version-1's chain home directory, that would trigger the golang/cosmos/app code that says "Aha, version-2, I got this", and does any cosmos-level data migrations needed. Then that migration would dispatch a message to cosmic-swingset's version-2 handler (something like { type: 'UPGRADE_SWINGSET', upgradeName: 'version-2' }), which runs the SwingSet version-2 data migration (reinitializeSwingset as you specified). After that, the chain continues startup, continuing past the version-2 upgrade, and begins block 1000.

kriskowal · 2022-04-28T02:41:31Z

I’ve punted this to MN-1.1. Thanks for talking me through this, @warner.

warner · 2022-07-02T22:56:52Z

We decided (and executed, in #5679) to stop persisting the kernel bundle, so we now re-bundle the kernel each time the application launches. This will automatically pick up the current kernel code, removing that portion of the motivation for this API.

We don't yet have a story for discrete upgrades of the kernel DB: basically the new kernel code must be prepared to handle data from any previously released kernel. We can introduce new DB keys to indicate the version of specific tables, if that helps. But the thing we're missing (and may or may not need) is some sort of distinct "you've been upgraded!" trigger that causes a schema conversion.

Such an event would help us know if/when to rebundle the lockdown and supervisor (and liveslots) bundles, which have similar needs as the kernel bundle did, but aren't as obvious a candidate for the same "unpersisting" change we just made for the kernel.

warner · 2023-01-24T18:26:38Z

We've revised our plan for lockdown and supervisor (moving them into a separate "worker-v1" package, #6596), so we no longer need a rebundling trigger. We do still need DB upgrades, but that should be managed by the kernel as it inspects a version flag in the DB itself.

So I'm going to close this in favor of the #6596 plan and having the host app change the version of its dependency upon @agoric/swingset-vat, without any persistent kernel source code in the DB.

warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 25, 2022

warner mentioned this issue Jan 25, 2022

upgrading liveslots: controller.setLiveslotsBundleID() #4376

Closed

warner added the MN-1 label Jan 25, 2022

Tartuffo removed the MN-1 label Feb 7, 2022

warner mentioned this issue Feb 8, 2022

refactor swingset source code to match runtime artifacts #4502

Closed

warner self-assigned this Feb 9, 2022

warner mentioned this issue Feb 9, 2022

tools to debug+resume after kernel panic #1677

Open

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

This was referenced Apr 5, 2022

Sufficient durability / upgradability of Vat contracts for launch #5014

Closed

xsnap launches with different metering limit from snapshot vs from empty #5040

Closed

Tartuffo removed this from the Mainnet 1 milestone May 2, 2022

warner mentioned this issue Jun 28, 2022

unpersist the kernel bundle (was: API to trigger kernel rebundling) #5679

Closed

warner mentioned this issue Jul 9, 2022

record bundleID of lockdown/supervisor/liveslots for each vat #5703

Closed

Tartuffo added migrate-product-backlog and removed migrate-product-backlog labels Nov 17, 2022

warner closed this as completed Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrading the kernel itself: controller.setKernelBundleID() #4375

upgrading the kernel itself: controller.setKernelBundleID() #4375

warner commented Jan 25, 2022

warner commented Feb 10, 2022 •

edited

Loading

warner commented Mar 16, 2022

warner commented Apr 27, 2022

michaelfig commented Apr 28, 2022

kriskowal commented Apr 28, 2022

warner commented Jul 2, 2022 •

edited

Loading

warner commented Jan 24, 2023

upgrading the kernel itself: controller.setKernelBundleID() #4375

upgrading the kernel itself: controller.setKernelBundleID() #4375

Comments

warner commented Jan 25, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Feb 10, 2022 • edited Loading

warner commented Mar 16, 2022

warner commented Apr 27, 2022

michaelfig commented Apr 28, 2022

kriskowal commented Apr 28, 2022

warner commented Jul 2, 2022 • edited Loading

warner commented Jan 24, 2023

warner commented Feb 10, 2022 •

edited

Loading

warner commented Jul 2, 2022 •

edited

Loading