-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel API for upgrading vats #1848
Comments
How would this compare to upgrade purely at the Zoe level? |
Hm.. what does a Zoe-level upgrade mean? I can think of a few options. One is that we don't attempt to modify existing contracts, we just make sure that any new ones are created with the new version. That approach wouldn't even require any code changes: you just install V2, tell everyone about it, and then hope they decide to use V2 instead of V1 going forward. You might add a way to make V1 closed for business (denying the ability to instantiate it ever again). Overall it'd be pretty simple, maybe useful in some cases, but probably not very general-purpose (it kinda qualifies as "upgrade" but not really). Another option would be for contracts to be shipped with upgrade functionality already present: some code that could, in response to some carefully-managed message, evaluate a new source bundle and hand all its state to the result. Zoe or ZCF might hold on to that upgrade facet and only exercise it in response to some governance-type voting mechanism (or only enable it if some assertion trips, or engage a time-lock veto period, or any number of nifty safety mechanisms we might think up). That would probably be the best in terms of making it clear up front what circumstances might trigger an upgrade, and what would happen to the state as the upgrade happens. The downside is that you have to figure out all of that ahead of time, and whatever you miss might not be upgradeable (or at least not without some deeper mechanism: I expect we'll have multiple layers, and we'll use the shallowest tool that does the job). The state transfer is the part I'd be most worried about: we don't know what the V2 behavior will be ahead of time (if we did, we'd publish that instead of V1), so I'm not sure we could write a correct state exporter ahead of time. But it might be possible. I should note that this proposed API could be used either for dynamic vats (in which case Zoe would hold the upgrade facet for the dynamic vat that holds ZCF and a contract), or for static vats (in which case some deeper goveranance mechanism would hold the upgrade facet for the static vat that holds Zoe itself). Upgrading Zoe is, of course, a much more serious undertaking than upgrading a single contract. Upgrading a single contract vat cannot violate offer safety (well, I guess it depends upon how much power we invest in the ZCF code), but changing Zoe's behavior could cause all sorts of damage. So the mechanism around a Zoe upgrade would need to be that much more involved and cautious. This API is mostly about unplanned upgrade, where we weren't prepared enough to add something into the contract or into the vat to perform an orderly handoff of state from old code to new code, and we find ourselves in a situation where the only option is to re-run the vat from scratch but with different code. We should probably have both. |
Let's call the first version a "manual upgrade" or a "user-has-to-move upgrade". I would agree that that's the default and is currently possible. I was talking about something different, that I think your "other option" doesn't capture. We already have the mechanism for one contract to create an instance of another contract. And, we already have a task for allowing a Zoe contract to transparently use ZCF to transfer offers to another contract. So no need for any special state exporter for upgrade at the Zoe contract level. So to sum up the "Zoe-level upgrade" that I'm describing: Contract A can be given an installation and start a new contract B and move the offers that were in contract A to contract B. If Contract A and Contract B are meant to maintain the same identity, we get upgrade for free out of features that we already know we need. Furthermore, this all must be in the contract code, so it's much better than vat upgrade in that it is transparent. Upgrade of this kind can only happen if the code allowed it, so the user can read the code and see where upgrade might or might not happen and decide whether to join on that basis. That's good to know that this ticket is about the unplanned upgrades. I think it makes sense for upgrading Zoe itself, but I think it doesn't make sense to use vat upgrade for upgrades of contracts if we have a mechanism at a higher level. |
Because the contracts running under Zoe are how our users express credible commitments to each other, and because Zoe provides the installation and the source code as validation that the commitments you're interacting with are according to that code, I think that contract upgrade is its own conversation. The code should express somehow what the possibilities for its future upgrade are, and what will be the process by which that is decided. ZCF might well expose an API to make some choices straightforward to express, even those that entail magic brain surgery on the state of the contract. The text of the contract, by omission, makes strong credible commitments about what kinds of upgrades are not possible, and what kinds of decisions about upgrades are not possible. |
I posted the above before I read @katelynsills '. I think we're agreed on the fundamental principles --- a contract is only upgradable to the extent, and in the manner, that the code of the contract visibly states. @katelynsills also makes a distinct point that we should prefer less magical mechanisms over more magical mechanisms when these are adequate. Despite my brain surgery comment, I agree with this preference. But this is one where I would not be surprised to learn that more magical interventions are indeed sometimes necessary. We need some realistic experience before concluding that we need more magic. In a separate conversation, @dtribble points out a crucial case that the contract cannot solve for itself without magical help from ZCF and Zoe. If a contract panics --- if it hits an internal error such that it knows that its state is corrupted and it cannot continue, or if something outside the contract with the right to terminate the contract (ZCF, Zoe, SwingSet kernel) makes this determination about the contract or its vat, the default behavior is zcf/contract vat death followed by Zoe doing a payout/exit of all the seats associated with that contract. Our system currently supports only this default. @dtribble points out that this default would be catastrophic for some contracts. @katelynsills made an interesting suggestion that provides a least-magical way to handle this case specifically: A contract might say only how to "upgrade" it if it panics. If it says only this, then it is immutable until and unless it panics. In whatever way the contract expresses what happens if it does panic, it again would still need to commit to how these decisions get made, in order to still have a credible commitment to how these decisions will not be made. |
These instructions to Zoe about how it should handle the sudden death of the contract vat can be thought of as an "advance directive". The statement about what should happen to the live offers, or the assets to the extent that the contract has leeway to reallocate those, can be thought of as a "will", with the receivers being the contract's heirs. The leeway issue is interesting. For any manual upgrade that the contract does for itself, to Zoe, that's just the contract code doing more stuff it is allowed to do. Nothing distinguishes it as an upgrade. Thus, such manual upgrades necessarily cannot violate offer safety or payout liveness, since Zoe enforces that even on adversarial contracts. In the case of the advance directive, we can still enforce both offer safety and payout liveness. We should still enforce offer safety. We probably should still enforce payout liveness, but this is less clear. Such payout liveness imposes a deadline on any emergency repairs. The following is probably a bad idea, but I can at least imagine that we might want to have exit conditions with two deadlines, where the longer one applies only during states of emergency. The fact that a state of emergency is only caused by a panic, and the (presumably vetted) contract code only panics on an undetected bug, provides some degree of safety against abusive declarations of states of emergency. |
Dispute resolution is much like upgrade, and we may in fact treat it like upgrade. (Attn @kleros) In a split contract (like AMiX) the players can throw the contract into dispute. If they do, then a pre-agreed dispute resolution procedure is engaged, involving either pre-agreed arbiters or a pre-agreed means of selecting arbiters. And a pre-agreed means of composing the judgements of the arbiters into a decision. This can all be done manually without any system support. Enforcing offer safety on the dispute resolution outcome is awesome and unprecedented. This tremendously lowers everyone's risks from corrupt arbiters. However, payout liveness raises a similar dilemma as #1848 (comment) . Enforcing payout liveness imposes severe deadlines on the dispute resolution process. Deadlines that are reasonable for automatic execution may be painful to apply to a process of human judgement. Again, the following is a bad idea, but I can imagine that split contracts somehow accept a pair of deadlines, where the longer one applies only during dispute resolution. Unlike the rest of dispute resolution, this would require some new mechanism be provided by Zoe/ZCF. |
Yeah, upgrades of the contract code that are expressed entirely within the original contract code are the most pleasant of the alternatives (an orderly transfer of obligations). Replacing a vat via some magical unplanned-for process is the second-least pleasant. Changing the behavior at an even lower level would be the first-least pleasant. I suspect we may need all of these mechanisms sooner or later. Let's use "contract upgrade" for the first case, the one @katelynsills is describing. And "vat upgrade" for the kernel-implemented vat-admin-facet-managed mechanism described in my opening comment here. I'm super interested in the notion of The easiest case I can imagine is a false trigger: we have some invariant check that turns out to be more strict than is really necessary, and we don't notice it until some runtime event provokes it into firing. I think we'd want the contract to freeze in place, not triggering refunds or vat death or anything, just hit pause. (Maybe the kernel should arrange to rewind the crank that triggered the assert first, so the vat is in a known state, rather than executing just the first half of the operation). Then we humans investigate and figure out what happened. When we conclude that the assert was buggy and the state is actually just fine, we'd want a way to resume operation. I can imagine some voting mechanism (perhaps with a fairly small / minimally-stringent set of authorized parties) which disables the assert (perhaps for one crank only) and redelivers the message. The next more complex situation I can imagine is one where our investigation reveals an actual problem, and we conclude that the simplest response is to kill the contract instance and have Zoe execute the payouts. It might be good if this were the default behavior if we don't implement the pause button (I think an The next more complex case would be us concluding that there is a real problem, but allowing the contract to unwind is not desireable, and we'd rather move the offers to a new contract where the problem is fixed. For this, we might want to abandon the last message (the one that triggered the More complex situations would be where our investigation reveals the problem originating before the |
Here's a sketch of an upgrade API:
|
@FUDCo and I walked through some more ideas:
Some other ideas: We could give a secondary We might want two separate migration events: one which begins replay, and a second "cutover" event which replaces the old vat worker with the migration worker. These could be spaced out over several weeks, to give validators a chance to let the migration worker catch up. The begin-replay event might not need to be part of consensus, it could be an auxilliary message sent into SwingSet to prepare for a cutover. The cutover instruction is part of consensus, and could include a hash of the expected migration vat state (basically a hash of the secondary DB writes). Operationally, we'd announce a planned upgrade, validator operations would submit the first event, we'd wait a few weeks or something for all kernels to prepare the replacement worker, then the governing body would submit the cutover event. Once all the important state is in the DB, we could perform a dummy upgrade (no code changes) on a regular schedule, perhaps once a month, which wouldn't change any behavior but would truncate the transcript. If we did this across all vats at the same time, a hopeful new validator (who has no state yet) could catch up efficiently if they launch just after the upgrade finishes. They can copy the DB state from an existing validator (assuming we get that hashed properly), and then they can launch new workers and won't have any transcripts to replay. This would not require us to rely upon deterministic/consensus heap snapshots. |
Thoughts on upgradeThis document is to capture the state of my thinking on upgrade. It is not yet a design, but a place to work out the ideas that will ultimately lead to a design. I expect that this will evolve into an actual design document as our understanding gels. (Note: in a lot of our conversations we have used the term "upgrade". Upon reflection, I think the word "upgrade" implies a value judgement that is irrelevant to the problem at hand. From my perspective, the key issue is how to enable the code to be changed, rather than whether the change itself is an "upgrade". Improvement is often the motivation for changing code, but it's the change itself that introduces the technical challenges. This suggests that "update" might be a better term to use, and for a while I switched to using the term "update" throughout this document. However, in discussion @warner and I concluded that "upgrade" is the term that's been used in most conversations on this topic so far, and it's the word used in the various issues that have been filed on this topic, so for now "upgrade" it is.) Upgrade strategiesWe have identified three flavors of upgrade strategies that each present different tradeoffs with respect to flexibility, difficulty, API complexity, and scheduling of engineering effort. The principal distinction between them is the amount of upfront future-proofing work that must be done inside the deployed vat code. These are not so much upgrade paradigms competing to be The API, but differing approaches that will be appropriate in different circumstances depending on the constraints of the upgrade problem at hand in any particular case. Designed-in anticipated upgrades - "Builtin"To the extent that the creators of a body of code (in particular, a contract or other service that runs in/as a vat) anticipate the need for specific changes in the future and thus build in mechanisms to support them directly, we can regard upgrade as a purely user-space problem. This is probably sufficient for simple things like parameter changes (e.g., tweak an interest rate setting) that can be designed into an application's own API and effectuated without any actual modifications to the code itself. While it is plausible that we could add some library support to make this kind of thing easier, I'm not sure at this point what such support might look like. However, I don't think are any fundamental problems we need to solved right now for this case. Since any particular use of the Builtin strategy is self-contained within whatever application makes use of it, we will not consider it further here. This is not to downplay its importance -- which I expect it to be significant -- but rather reflects that since it is (by definition) entirely inside the application domain it has no specific implementation impact on the design of Swingset. The time travelling Manchurian Candidate sleeper agent protocol - "Replay"To the extent that a body of code is written without any forethought at all regarding upgrade, we would instead have to resort to a more brute force approach based on wholesale code replacement. Our current best idea of how to do this is the thing that Brian refers to as "the time-travelling Manchurian Candidate sleeper agent protocol", where we substitute the code that defines a vat prior to t0, then execute a vat replay from the very beginning of time in a "do everything exactly the same as before" mode until reaching a predetermined switchover point, whereupon the new code, now having complete access to any hidden internal state that might have been invisible from the transcript but which got recomputed during the replay, can begin to express new behavior or present a changed API to its clients. In principle, this approach should be sufficient for essentially any imagineable upgrade, but is likely to prove tricky to orchestrate due to the need to never diverge from the recorded transcript during the replay execution. This could be made slightly less tricky by relaxing the deterministic replay rules slightly. In particular, we could:
This approach leaves open the problem of how to get rid of the old vat code once it is no longer required, since that would itself be an additional code upgrade. Obviously you'd like to be able to do this without leading to an infinite regress. One strategy might be to have each new upgrade remove the code that was obsoleted by the previous upgrade, but this feels very unsatisfactory to me for lots of reasons both practical and aesthetic. Another consideration is that in a production setting, it might be faster, and thus desirable, if we could pause a Swingset while we stop and replay a single vat, leaving the other vat processes undisturbed. If an upgrade needs to regenerate the internal state of a vat (or perhaps a subset of vats), it should not be necessary to also replay all the other vats that are not being upgraded. Note that the current replay mechanism already replays individual vats sequentially, rather than interleaving their execution in the way it was necessarily interleaved when they executed originally. This suggests that implementing single-vat replay might be reasonably straightforward, but as yet the kernel does not have any actual mechanism to do this. (We have previously floated the notion of arranging for vats in separate processes to replay in parallel, as a way to speed up restart. It seems plausible to me that the work needed for selective single-vat replay and the work needed for parallel replay may share some common elements.) Startup from explicit persistent state only - "Cold Start"Intermediate between "all has been forseen" and "no change was ever contemplated" is the kind of approach we expect to be followed for most upgrades. To the extent that an application can capture in the vatstore all the information needed to reestablish a completely functional working state, a vat process can simply be stopped, have the code that implements it replaced, and then restarted without replay (indeed, we already do restart without replay for transcriptless vats such as the comms vat). Instead of replaying from t0, it would rebuild its working state directly from the persistent store, possibly performing any required data migration or schema upgrades as part of this (whether to execute such data changes in a batch at upgrade time or incrementally as part of the future execution of the vat process is an important practical question, and one that could have significant impact on the upgrade API design, but I don't believe it's a question fundamental to this upgrade strategy per se). Since the Cold Start strategy requires that all information necessary to resume operation be captured in the persistent store, it follows that suitably validated copies of the persistent store could also be used to initialize new validator instances. This seems better than demanding that anyone who wants to spin up a new validator be willing to assume the cost of completely re-executing the entire history of the entire chain from the very beginning. We've been worrying about this problem for some time, completely outside the context of the upgrade problem. While XS process snapshots provide a way to restart a vat without requiring replay, they don't lend themselves to being shared in a trustless way with others. In contrast, the contents of the persistent store can be so shared, since its history of data modifications -- and thus its state at any given time -- is part of the chain's consensus state. This suggests that as part of normal operation the chain should periodically (perhaps monthly) checkpoint the persistent store and place a hash of this into a block. New validators could then begin operation only having to replay any activity that had happened since the last of these periodic checkpoints -- or more likely, simply choose to begin operation at the time such a checkpoint is made. Choosing and combining strategiesA question one might reasonably ask is: given that the Cold Start strategy not only supports upgrade more easily and directly than the Replay strategy, but might also end up being mandatory anyway to enable an open validator ecosystem, why spend time thinking about the Replay strategy at all? The answer is that it provides a recovery pathway in the plausible event of imperfect foresight. If it proves to be the case (presumably by mistake) that a vat actually had some hidden state whose loss might cause consensus breakage, a Replay upgrade might be our only way out of the problem. Note that if this happened it would probably be subtle and wierd, since if we're normally stopping and restarting vats from cold storage with some frequency it seems likely that such a problem would manifest quickly (in particular, during testing before the code in question is even released). Indeed, one approach might be to delay investing any significant engineering effort into implementing the Replay strategy at all until such a time as we find ourselves in a situation where it is needed. Especially if we engineer all of our basic services and contracts around Cold Start upgrades, it is not entirely crazy to speculate that the need for Replay upgrades will never actually happen. The worry, though, is that if we do find ourselves in such a state of need, it might very well be in a situation of extreme time pressure to rectify some kind of urgent, catastrophic operational problem. Consequently we need to think very carefully about how best to invest our development resources here. One thing that does seem clear is that in the event of a Replay upgrade, one of the things that the upgrade should try to accomplish is to leave the vat in a state where further upgrades can be effected using the Cold Start strategy. In particular, this answers the question raised above as to how a Replay upgrade gets rid of the old code that has been superceded: it is the followed by a Cold Start upgrade that does this. APIThere are two aspects of the upgrade API that can largely be considered independently. These can be roughly labelled the internal API and the external API. The internal API is used by code within a vat to actually upgrade itself and its data. It is concerned with how the vat accesses persistent storage, how it learns what mode it is executing in, how it determines what it is supposed to do, and so on. It is principally about the means for actually interrogating and manipulating a vat's memory state and data store to effectuate any needed changes. The external API is used to manage an act of upgrade: when it happens, how it gets initiated, and what actual changes are permitted (e.g., which code bundle gets substituted for whatever the vat was running previously). It is principally about governance and access control. I anticipate that the actual usage of the internal API will vary idiosyncratically from one upgrade to another depending on the nature and complexity of the changes the upgrade entails, whereas the external API will be used in fairly stable arrangements within the implementations of the various hosting frameworks in which vats are run, largely independent of the details of any particular upgrade. Internal (transmogrification) APIBoth the Replay and Cold Start strategies presume that, from the vat's point of view, the code implementing the vat has been replaced prior to execution with the upgraded code. That new code can effect any data changes needed in addition to implementing any new or changed vat behavior. Consequently, the internal API is not concerned with getting the replacement code installed but rather with what that code is able to do once it's running. Given that constraint, the internal API needs to enable the code to do three things:
In the Cold Start scenario, requirement #1 above can be accomplished by interrogating the persistent state. Upgrade code can record its status in the vatstore for later reference. Absence of such a record can be taken as an indicator that upgrade has yet to happen. In the Replay scenario, things are a bit more subtle, because the code is pretty much by definition in a situation that had not been prepared for it. However, even then inspection of the persistent state is likely sufficient because the code can presume that it is starting from the beginning; indeed, as mentioned above, I expect that a key goal of most Replay upgrade code will be to transform the vat state into one that can henceforth be upgraded via the Cold Start strategy. Consquently, the means to satisfy requirement #1 can be folded into the means to satisfy requirements #2 and #3. Stored data can take two forms: (a) data explicitly read from or written to the vatstore using the vat power provided for this purpose, indexed by keys managed by the vat code itself, and (b) virtual objects, which in normal operation are managed implicitly by the VOM, stored using keys that are purposely hidden from the vat code. Case (a) does not require special treatment for upgrade, since everything is under the vat code's direct control, so essentially all of the API design challenge concerns case (b). Persistent collections, once we have them, could introduce further complications, but given that collections are still a work in progress I'm not sure it makes sense to invest a lot of work in them here. A couple of observations: to the extent that they are explicit collections of explicit data, they would fall under case (a) above. To the extent that they contain references to virtual objects, they would indirectly fall under case (b), but not in a way that I think introduces any additional wrinkles into the design. It is entirely conceivable that there are semantic weirdnesses that I've missed which will make the story messier, but for now I'm going to set this question aside. Given that case (a) is satisfied by the existing data access API, all the further design work here need only concern case (b), namely the how to enable some kind of explicit interaction with persistent data that had hitherto been accessed only implicitly. Each virtual object has an associated (in-memory) kind object that provides implementations for its behavior and its instance initialization logic. The execution of the initialization logic in turn defines the virtual object's shape, i.e., how it is to be serialized and deserialized to and from persistent storage. Each kind is assigned an internal kind ID when it is created. This kind ID is subsequently used as part of the vrefs of its instances, so that when a virtual object is read from disk, the vref can be used to locate the kind definition to deserialize the object and to associate the in-memory representative that is thus created with the virtual object's behavior. A kind is described by an function makeFooInstance(state) {
return {
init(args...) {
state.whatever = value...
state.whateverElse = anotherValue...
...
},
self: Far('foo', {
method1(method1args...) {
do stuff...
},
method2(method2args...) {
do other stuff...
},
}
};
} The I propose adding two new global functions: The ...
upgrade(oldState, args...) {
state.whatever = oldState.whatever;
state.whateverElse = computeSomething(oldState.whateverElse);
state.wholeNewWhatever = someEntirelyNewValue...
...
},
... The If all of the instances of a virtual object are upgraded in a batch, then it will be sufficient for the I believe these two functions will be sufficient to enable the ColdStart upgrade case. However, the Replay case has an additional notable complication: the need to avoid any visible divergence, during replay, from the event stream that was recorded in the vat transcript. I believe this can be accomodated by executing Replay upgrade executions with an additional vat power, External (control) APIPerforming an upgrade on a vat involves replacement of the vat's code, which means that at the very least the vat needs to be shut down and restarted. The logic and workflow for doing this differs between static and dynamic vats, since dynamic vats can (by definition) be managed by other vats whereas static vats can only be managed by the Swingset kernel and its associated controller object (and, of course, indirectly by the host in possession of the controller object). For a static vat, the configuration (which specifies the bundle or source file that the vat code is to be loaded from) is a parameter to <as far as I've gotten written down> |
ZCF and Contracts, Baggage, ZygotesThe upgrade design we've sketched out over the last few weeks looks at roughly three categories. The smallest does not involve the kernel at all: simple parameter changes (which scarcely qualify as "upgrade") and in-same-vat evaluation of new code (but I believe @erights and others aren't a fan of that, and would rather see all code-replacing upgrades use a larger category). The middle category replaces the entire userspace code bundle and gives it a chance to use a subset of the data (virtual objects/collections) prepared by its predecessor. The largest category involves first using the #1691 sleeper-agent protocol to retroactively prepare, then applying the middle-category -type upgrade. This writeup concentrates on the middle category. While SwingSet will provide a way for any dynamic vat to be upgraded, the primary Agoric use case is specifically for contract vats. All such vats launch with the ZCF (Zoe Contract Facet) bundle, after which Zoe sends a message to ZCF with the contract bundle to be loaded. When we upgrade a contract, we're not generally upgrading ZCF: we're only upgrading the evaluated contract bundle. As a result, the "reload vat with different bundle" scheme described above won't actually help. Also, we need a design that is #2268 zygote-friendly. Our plan for this is to have ZCF check the "baggage" (data from the predecessor) to determine the state of the contract bundle: has it been installed/evaluated, both installed and started, or neither. Then, during We'll need a mechanism for Zoe to tell the new ZCF instance about the different contract bundle to use. The kernel upgrade API should allow the caller (Zoe) to provide both the vat bundle (ZCF, unchanged) and the Zygotes (#2268) allow us to amortize the cost of evaluating code bundles, by freezing a copy of the vat before it has differentiated too far, and using that copy as a template from which clones can be made and further differentiated. In particular, we can use the template vat's heap snapshot as a starting point, so the clones do not have to repeat the (expensive) code evaluation step. We expect to have a moderate number of contracts, but a much larger number of instances of those contracts, so if the evaluation of the contract code is non-trivial, it's probably a win to start from a post-evaluation heap snapshot. @erights suggested that the ZCF bundle is likely to be small compared to the contract bundles, and making a template/zygote out of the "post-ZCF but pre-contract" state wouldn't be worthwhile. I'm not sure I agree, but we'll need to measure the costs to know for real. To support zygotes, the first time around, contract installs will be delivered with a separate message, as are contract instantiations. This provides multiple stages of differentiation in the vat's life, any of which might be used as a template:
We certainly expect to build a zygote template out of stage 2 (one per contract). Zoe will remember this template as part of the "contract install", and make a clone each time it is asked to instantiate the contract. We might also choose to record a single template from stage 1 (just ZCF, no particular contract yet), to accelerate the So the contract bundle will appear in an ZCF must be prepared to be executed in either the initial (empty) vat, or to find itself in a pre-existing vat (with baggage). Likewise, contract code must be equally prepared to be the new version instead of the original. ZCF must extract part of the "baggage" and share it with the contract's So the first version of the contract vat will go like this:
Then, when Zoe and some governance mechanism decides that this instance should be upgraded to version 2:
|
Failed Upgrades Should Leave Old Version In PlaceAs mentioned in the meeting this afternoon, one really desirable property would be for a failed upgrade to leave the vat in its previous configuration. I think we can pull this off by putting all of the upgrade steps into their own crank, which (thanks to the crankBuffer) can be committed or rolled back as a unit. Just as we currently have a I believe the heap snapshot ID is recorded in the KV store, so as long as the lifetime management of The one part that might be funny is erasing the transcript, because transcripts live in the streamStore, and are indexed by vatID. The |
streamStore commits are effected by tracking the index of the end of the stream in the KV store. |
If the end-of-stream index is set (in the crank buffer) to 0, then we write a new transcript entry, then we set end-of-stream to the end of that new entry, then we revert the crank buffer.. is the start of the original transcript now clobbered? I think the transactionality of the streamStore relies upon it being append-only, and "erasing" the transcript violates that assumption. If so, then we must either improve streamStore to tolerate this, or make sure we don't write the new entry until after we know we're going to commit for real. Maybe we incorporate a crankBuffer-like thing into streamStore that holds the new entry in RAM until |
This is a first pass at the API you'd use to tell the kernel to upgrade a dynamic vat. None of this is implemented yet. refs #1848
This allows liveslots to abandon a previously-exported object. The kernel marks the object as orphaned (just as if the exporting vat was terminated), and deletes the exporter's c-list entry. All importing vats continue to have the same access as before, and the refcounts are unchanged. Liveslots will use this during `stopVat()` to revoke all the non-durable objects that it had exported, since these objects won't survive the upgrade. The vat version being stopped may still have a Remotable or a virtual form of the export, so userspace must not be allowed to execute after this syscall is used, otherwise it might try to mention the export again, which would allocate a new mismatched kref, causing confusion and storage leaks. Our naming scheme would normally call this `syscall.dropExports` rather than `syscall.abandonExports`, but I figured this is sufficiently unusual that it deserved a more emphatic name. Vat exports are an obligation, and this syscall allows a vat to shirk that obligation. closes #4951 refs #1848
This iterates through all previously-defined durable Kinds and asserts that they have been reconnected by the time buildRootObject() completes. It still needs better error delivery path: we want the upgrade to fail and get rolled back, but currently `startVat` doesn't have a good way to signal the error. refs #1848
This iterates through all previously-defined durable Kinds and asserts that they have been reconnected by the time buildRootObject() completes. It still needs better error delivery path: we want the upgrade to fail and get rolled back, but currently `startVat` doesn't have a good way to signal the error. refs #1848
We create a durable Kind, and reattach behavior to it in v2. The handle must travel through baggage, demonstrating that baggage works. I'm still looking for the right way to use VatData these from with swingset tests.. other packages should import @agoric/vat-data, but that might be circular from here refs #1848
This iterates through all previously-defined durable Kinds and asserts that they have been reconnected by the time buildRootObject() completes. It still needs better error delivery path: we want the upgrade to fail and get rolled back, but currently `startVat` doesn't have a good way to signal the error. refs #1848
This iterates through all previously-defined durable Kinds and asserts that they have been reconnected by the time buildRootObject() completes. It still needs better error delivery path: we want the upgrade to fail and get rolled back, but currently `startVat` doesn't have a good way to signal the error. refs #1848
This iterates through all previously-defined durable Kinds and asserts that they have been reconnected by the time buildRootObject() completes. It still needs better error delivery path: we want the upgrade to fail and get rolled back, but currently `startVat` doesn't have a good way to signal the error. refs #1848
I have two plans for deleting/dropping/abandoning everything. I'm working on implementing the first, but I wanted to write up the second and see if we can switch to it by MN-1 because it has some benefits. Plan 1: stopVat()In this approach, we rely upon being able to talk to the vat one last time before the upgrade. We send a Within
We could break the work up into phases at the calls to The second phase would also delete the virtual collections and all virtual objects, but wouldn't try to decref anything they pointed to. This would fix the DB space leak, but would still leak some number of imports and durables. The third phase would do all three. Reference cycles within the durable subgraph could still leak durables and imports. Assuming that people use virtuals/durables appropriately and don't keep a lot of references in RAM, the first phase costs O(N) in the number of exported virtuals, the second adds in O(N) in the total size of virtual collections, and the third adds O(N) in the number of virtual objects and the count of edges (really the size of the set of referenced objects) from VOs to imports and DOs. Plan 2: all kernel-sideTo allow the kernel to do this work without the help of the retiring vat, the kernel needs to know which vrefs and DB keys are durable and which are not. We can either couple the kernel and liveslots together (which feels like a bad idea), or we can change the key formats to embed the information that the kernel needs to know. For the vat store, I'll propose enhancing the For the c-list, the proposal is more radical. Currently I'm suggesting that we add When the VOM creates vrefs for virtual/durable kinds, it uses Once we have those tools at our disposal, we don't need
If we were to break this approach into similar phases, the first phase would only clear promises and the virtual exports. With a simple kvstore DB and the same assumptions as above, the cost would be O(N) in virtual exports. The resulting state would be safe, but would leak DB space, imports, and otherwise-unreferenced durables. The second phase would omit the mark+sweep. The cost would be O(N) summed across virtual exports, the number of virtual objects, and the number of items in virtual collections. It would leak imports and otherwise-unreferenced durables, but would delete all the virtual data. The third phase (complete approach) would add O(N) in the number of referenced durable objects, plus their edges, plus another O(N) across all durable objects (for the sweep). It would not leak anything, not even when there are cycles within the durable subgraph. If/when we move to something like SQLite for the kernel DB, this might get more efficient. A single query could delete an entire range of keys (it's probably still O(N), but with a much smaller constant factor because the DB is optimized for it). Deleting the c-list reverse pointers can probably be done with a clever subquery. The mark phase could benefit from ComparisonI think the costs are similar. The complexity is a lot lower if we only try for the first two phases now (i.e. we tolerate leaked imports/durables that result from reference cycles). The third phase costs O(N) in the number of virtual objects for the The benefits of doing this work on the kernel, rather than But doing it on the kernel side requires more coordination between liveslots and the kernel. The ephemeral/durable vatstore and vref format removes a lot of the coordination, but performing a mark+sweep requires more. One idea I had for this was to bundle a subset of the liveslots code (just enough to understand the vatstore encoding formats and perform deletion) at the same time that we make the full bundle. For each vat, we're going to be stashing the bundleID of it's liveslots (#4376), so it's not hard to also stash a "deletion helper" bundleID. During upgrade, we could My big interest in doing it on the kernel side depends upon having something like SQLite, that's where things could really be sped up. Next StepsI have the |
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. The collectionManager was changed to keep an in-RAM set of the vrefs for all collections, both virtual and durable. We need the virtuals to implement `deleteAllVirtualCollections` because there's no efficient way to enumerate them from the vatstore entries, and the code is a lot simpler if I just track all of them. We also need the Set to tolerate duplicate deletion attempts: `deleteAllVirtualCollections` runs first, but just afterwards a `bringOutYourDead` might notice a zero refcount on a virtual collection and attempt to delete it a second time. We cannot keep this Set in RAM: if we have a very large number of collections, it violates our RAM budget, so we need to change our DB structure to accomodate this need (#5058). refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. The collectionManager was changed to keep an in-RAM set of the vrefs for all collections, both virtual and durable. We need the virtuals to implement `deleteAllVirtualCollections` because there's no efficient way to enumerate them from the vatstore entries, and the code is a lot simpler if I just track all of them. We also need the Set to tolerate duplicate deletion attempts: `deleteAllVirtualCollections` runs first, but just afterwards a `bringOutYourDead` might notice a zero refcount on a virtual collection and attempt to delete it a second time. We cannot keep this Set in RAM: if we have a very large number of collections, it violates our RAM budget, so we need to change our DB structure to accomodate this need (#5058). refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. The collectionManager was changed to keep an in-RAM set of the vrefs for all collections, both virtual and durable. We need the virtuals to implement `deleteAllVirtualCollections` because there's no efficient way to enumerate them from the vatstore entries, and the code is a lot simpler if I just track all of them. We also need the Set to tolerate duplicate deletion attempts: `deleteAllVirtualCollections` runs first, but just afterwards a `bringOutYourDead` might notice a zero refcount on a virtual collection and attempt to delete it a second time. We cannot keep this Set in RAM: if we have a very large number of collections, it violates our RAM budget, so we need to change our DB structure to accomodate this need (#5058). refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. The collectionManager was changed to keep an in-RAM set of the vrefs for all collections, both virtual and durable. We need the virtuals to implement `deleteAllVirtualCollections` because there's no efficient way to enumerate them from the vatstore entries, and the code is a lot simpler if I just track all of them. We also need the Set to tolerate duplicate deletion attempts: `deleteAllVirtualCollections` runs first, but just afterwards a `bringOutYourDead` might notice a zero refcount on a virtual collection and attempt to delete it a second time. We cannot keep this Set in RAM: if we have a very large number of collections, it violates our RAM budget, so we need to change our DB structure to accomodate this need (#5058). refs #1848
This deletes most non-durable data during upgrade. stopVat() delegates to a new function `releaseOldState()`, which makes an incomplete effort to drop everything. The portions which are complete are: * find all locally-decided promises and rejects them * find all exported Remotables and virtual objects, and abandons them * simulate finalizers for all in-RAM Presences and Representatives * use collectionManager to delete all virtual collections * perform a bringOutYourDead to clean up resulting dead references After that, `deleteVirtualObjectsWithoutDecref` walks the vatstore and deletes the data from all virtual objects, without attempting to decref the things they pointed to. This fails to release durables and imports which were referenced by those virtual objects (e.g. cycles that escaped the earlier purge). Code is written, but not yet complete, to decref those objects properly. A later update to this file will activate that (and update the tests to confirm it works). The new unit test constructs a large object graph and examines it afterwards to make sure everything was deleted appropriately. The test knows about the limitations of `deleteVirtualObjectsWithoutDecref`, as well as bug #5053 which causes some other objects to be retained incorrectly. The collectionManager was changed to keep an in-RAM set of the vrefs for all collections, both virtual and durable. We need the virtuals to implement `deleteAllVirtualCollections` because there's no efficient way to enumerate them from the vatstore entries, and the code is a lot simpler if I just track all of them. We also need the Set to tolerate duplicate deletion attempts: `deleteAllVirtualCollections` runs first, but just afterwards a `bringOutYourDead` might notice a zero refcount on a virtual collection and attempt to delete it a second time. We cannot keep this Set in RAM: if we have a very large number of collections, it violates our RAM budget, so we need to change our DB structure to accomodate this need (#5058). refs #1848
@warner Although the upgrade design for contracts specifically is still a work in progress, from the kernel perspective can we regard this one as done? |
There are several remaining tasks that are important for MN-1, which I'll break out into new tickets:
|
Now that we've got new tickets for the remaining MN-1 work, I'll close this one. |
What is the Problem Being Solved?
#1691 describes a way to upgrade a vat by replacing its code and replaying its transcript, such that the new code behaves exactly like the old code up until some future cutover event.
If/when we need to use such a thing, we'll need a way to authorize its use. We want an ocap-appropriate mechanism to change the behavior of existing objects, so that clients of those objects can safely+correctly rely upon that behavior. In particular we want the clients of a contract, who have examined the code of that contract and are relying upon it acting in a certain way, to have a clear model they can follow. By enabling upgrade at all, we change the social contract from "this object will always behave according to code X" to "this object will behave according to code X as amended by upgrade decisions", along with some specific rules about how such upgrade decisions can be exercised.
The ultimate lever to change the behavior of the system is ownership over the underlying platform. In a consensus machine (on-chain), that is expressed by a suitable majority of the validator power deciding to run alternate code. We're looking for a smaller lever that can be expressed in more ocap-ish terms. I'm thinking of a mechanism that allows one vat at a time to be modified, rather than being able to make arbitrary changes to any part of the system.
Description of the Design
Here's a vague sketch, I haven't thought through the details at all.
What if the creator of the vat receives, in addition to the vat's root object and the authority to terminate the vat from the outside, an extra
upgrader
object. This would accept anupgrade(newCodeBundle, args)
message.When used, this asks the kernel to build a new vatManager around
newCodeBundle
and replay the entire transcript of the old vat. If this replay fails to match every syscall,upgrade()
rejects and nothing else happens. If replay succeeds, the new bundle is then issued a special "you have just been upgraded" message (maybe namedcutover()
), which includes theargs
that were given toupgrade()
. This cutover message gets one crank to finish, after which the old vat is retired and the new vat takes its place.The cutover message should maybe be sent to a special object, rather than the root object. One option is for
newCodeBundle
to export bothbuildRootObject()
andbuildCutoverObject()
, and the cutover message is sent to the latter. The goals here are to 1: make it clear to a reader when precisely the behavior is allowed to change, and 2: improve the readability of the bundle by separating the "steady state" behavior from the pieces needed specifically for the upgrade.The cutover method might need to invoke some new syscalls to move objects to different vats, or change their IDs (to hierarchical ones) to move their data into secondary storage.
Another tool that might help make the new bundle easier to read could be to represent all upgrades as an initial "gather data" phase, followed by a "reload data" phase, followed by a regular "run" phase. The gather-data phase would schematize the state of the vat (which might be spread across WeakMaps or whatever) into some flat copyable capdata plus a table of exported object references. The reload-data phase would get those two tables, but otherwise ignores the old code entirely. It reconstructs new objects to implement a suitable new state, and uses some special syscalls to transfer the identity of the old objects to the new ones. Then it switches to the "run" phase which has no remaining trace of the upgrade code.
This would let the upgrade process be examined independently of the new-behavior code. The reader who is interested in how the new vat behaves only has to read the "run" phase. If they are willing to assume the upgrade went smoothly, they can pretend the vat has only ever used the "run" phase of the new code.
The reader interested in the upgrade process can split it at the schema of the data emitted by the gather-data phase (and consumed by the reload-data phase). If the vat was storing most of its state in secondary storage, these phases might be fairly small.
The new syscalls to support identity transfer of individual objects might be expressed at the ocap level as a new
vatPowers.become(oldObject, newObject)
(remembering thatvatPowers
are initially only available tobuildRootObject()
, which can choose to share some of them elsewhere, or not). Or perhapsbecome
should only be made available to the upgrade code, to be used during the reload-data phase. Clearlybecome
is a high-power authority that must be carefully controlled, because in general having message-sending access to object X should certainly not give you the ability to receive other sender's inbound messages to X as well. The upgrade code is special and more powerful than any of the normal runtime code.Once vats have an
upgrade
facet, we can build governance/voting mechanisms around its use out of our normal ocap tools. Zoe, when asked to instantiate a contract into a new vat, could also be told what the upgrade policy should be. Zoe then holds the upgrade facet and only exercises it when told to by a suitable vote.For upgrades that are driven by validator consensus, rather than some other constituency, here's a thought: we number the upgrade proposals, validators sign a Cosmos transaction that includes a message to a Cosmos-SDK governance/voting module, that module watches for the votes to meet a passing threshold, if/when that happens the module sends a special message into SwingSet (following similar pathways as IBC and Mailbox messages travel). That message is routed into some special vat (perhaps Zoe) which can react to it. Maybe someone else sends a message into Zoe first to register the
newCodeBundle
and assign it a number, then a subsequent validator vote to activate that number can be the signal that Zoe uses to exercise the upgrade facet.The text was updated successfully, but these errors were encountered: