Include extra replay data in vat transcripts #6770

mhofman · 2023-01-10T16:45:59Z

What is the Problem Being Solved?

When replaying vats with the replay-transcript.js tool, some information is useful or necessary to ensure fidelity of the replay with the original:

The schedule of heap snapshot saves: they cause a forced GC impacting the following state of the worker
The computrons used by a delivery influence the Swingset execution (through the run policy) but cannot be matched during replay with the tool.

Description of the Design

This data should be included along other vat transcript data in the stream store.
I believe stream store additions are not currently part of the activityHash generation, but even if it were, both of these are effectively part of consensus operations already, so they could safely be included even then (we decided to move heap snapshots hashes under consensus for state-sync, see #3769).

Observation

While we've also seen reload from snapshot triggering bugs in XS and impacting further execution, these operations should not be included in the transcript data because:

loading is not an operation performed under consensus
these are XS bugs which should disappear over time (see XS snapshot content deterministic after reload #5330)
the slog records those load from snapshot already
the replay tool is able to fork its execution on snapshot taking, allowing to explore the space of divergence more fully (feat(swingset-tools): Expand replay tool for anachrophobia diagnosis #6723)

Test Plan

Some unit test to verify data is included. No integration test exists for the replay tool.

The text was updated successfully, but these errors were encountered:

mhofman · 2023-02-08T17:48:56Z

@FUDCo this is the extra info I mentioned we should have recorded in the transcript.

warner · 2023-04-16T23:11:10Z

I'm working on this now, on top of the groundwork in PR #7428

* remove transcript.js, functionality merged into vat-warehouse * move transcript replay into vat-warehouse * remove vatSyscallHandler from manager-factory * it becomes an argument to manager.deliver() * created by vat-warehouse, not vat-loader * remove compareSyscalls from manager-factory * vat-warehouse embeds it in the syscall handler * remove workerCanBlock: always assumed * remove useTranscript from manager * all deliveries build a transcript entry * vat-warehouse only saves it if options.useTranscript is true * build full anachrophobia log message after delivery is complete * show full syscall list for the delivery * each annotated as ok/wrong/extra/missing * shorter transcript property names refs #6770

This introduces four new pseudo-delivery events to the transcript: * 'initialize-worker': a new empty worker is created * 'load-snapshot': a worker is loaded from heap snapshot * 'save-snapshot': we tell the worker to write a heap snapshot * 'shutdown-worker': we stop the worker (e.g. during upgrade) These events are not actually delivered to the worker: they are not VatDeliveryObjects. However many of them are implemented with commands to the worker (just not `deliver()` commands). The vat-warehouse records these events in the transcript to help subsequent (manual/external) replay tools know what happened. Without them, we'd need to deduce e.g. the heap-snapshot writing schedule by counting deliveries and comparing them against snapshotInitial/snapshotInterval . The 'save-snapshot'/'load-snapshot' pair indicates what a replay would do. It does not mean that the vat-warehouse actually tore down the old worker and immediately replaced it with a new one (from snapshot). It might choose to do that, or the worker itself might choose to replace its XS engine instance with a fresh one, or it might keep using the old engine. The 'save-snapshot' command has side-effects (it does a forced GC), so it is important to keep track of when it happened. The transcript is broken up into "spans", delimited by heap snapshots or upgrade-related shutdowns. To bring a worker up to date, we want to start a worker (either a blank one, or from a snapshot), and then replay the "current span". With this change, the current span always starts either with 'initialize-worker' or with 'load-snapshot', telling us exactly what needs to be done. The span then contains all the deliveries that must be replayed. The current span will never include a 'save-snapshot' or 'shutdown-worker': the span is closed immediately after those events are added, so replay will never see them. But a tool which replays a historical span will see them at the end. The types were improved to make `TranscriptDelivery` be a superset of `VatDeliveryObject`. We also record TranscriptDeliveryResult, which is currently a stripped down subset of VatDeliveryResult (just the "ok" status), except that save-snapshot includes the snapshot hash in its results. In the future, we'll probably record the deterministic subset of metering results (computrons, maybe something about memory allocation). refs #7199 refs #6770

This introduces four new pseudo-delivery events to the transcript: * 'initialize-worker': a new empty worker is created * 'load-snapshot': a worker is loaded from heap snapshot * 'save-snapshot': we tell the worker to write a heap snapshot * 'shutdown-worker': we stop the worker (e.g. during upgrade) These events are not actually delivered to the worker: they are not VatDeliveryObjects. However many of them are implemented with commands to the worker (but not `deliver()` commands). The vat-warehouse records these events in the transcript to help subsequent manual/external replay tools know what happened. Without them, we'd need to deduce e.g. the heap-snapshot writing schedule by counting deliveries and comparing them against snapshot initial/interval. The 'save-snapshot'/'load-snapshot' pair indicates what a replay would do. It does not mean that the vat-warehouse actually tore down the old worker and immediately replaced it with a new one (from snapshot). It might choose to do that, or the worker itself might choose to replace its XS engine instance with a fresh one, or it might keep using the old engine. The 'save-snapshot' command has side-effects (it does a forced GC), so it is important to keep track of when it happened. As before, the transcript is broken up into "spans", delimited by heap snapshots or upgrade-related shutdowns. To bring a worker up to date, we want to start a worker (either a blank one, or from a snapshot), and then replay the "current span". With this change, the current span always starts either with 'initialize-worker' or with 'load-snapshot', telling us exactly what needs to be done. The span then contains all the deliveries that must be replayed. Old spans will end with `save-snapshot` or `shutdown-worker`, but the current span will never include one of those: the span is closed immediately after those events are added. When the kernel replays a transcript to bring a worker up to date, that replay will never see 'save-snapshot' or 'shutdown-worker'. But an external tool which replays a historical span will see them at the end. The `initialize-worker` event contains `workerOptions` (which includes which type of worker is being used, as well as helper bundle IDs like lockdown and supervisor), as well as the `source.bundleID` for the vat bundle. The `save-snapshot` event results contain the `snapshotID` hash that was generated. The `load-snapshot` event includes the `snapshotID` in a record that could be extended with additional details in the future (like an xsnap version). The types were improved to make `TranscriptDelivery` be a superset of `VatDeliveryObject`. We also record TranscriptDeliveryResult, which is currently a stripped down subset of VatDeliveryResult (just the "ok" status), plus the save-snapshot hash. In the future, we'll probably record the deterministic subset of metering results (computrons, maybe something about memory allocation). In the slog, the `heap-snapshot-save` event details now contain `snapshotID` instead of `hash`, to be consistent. refs #7199 refs #6770

This introduces four new pseudo-delivery events to the transcript: * 'initialize-worker': a new empty worker is created * 'load-snapshot': a worker is loaded from heap snapshot * 'save-snapshot': we tell the worker to write a heap snapshot * 'shutdown-worker': we stop the worker (e.g. during upgrade) These events are not actually delivered to the worker: they are not VatDeliveryObjects. However many of them are implemented with commands to the worker (but not `deliver()` commands). The vat-warehouse records these events in the transcript to help subsequent manual/external replay tools know what happened. Without them, we'd need to deduce e.g. the heap-snapshot writing schedule by counting deliveries and comparing them against snapshot initial/interval. The 'save-snapshot'/'load-snapshot' pair indicates what a replay would do. It does not mean that the vat-warehouse actually tore down the old worker and immediately replaced it with a new one (from snapshot). It might choose to do that, or the worker itself might choose to replace its XS engine instance with a fresh one, or it might keep using the old engine. The 'save-snapshot' command has side-effects (it does a forced GC), so it is important to keep track of when it happened. As before, the transcript is broken up into "spans", delimited by heap snapshots or upgrade-related shutdowns. To bring a worker up to date, we want to start a worker (either a blank one, or from a snapshot), and then replay the "current span". With this change, the current span always starts either with 'initialize-worker' or with 'load-snapshot', telling us exactly what needs to be done. The span then contains all the deliveries that must be replayed. Old spans will end with `save-snapshot` or `shutdown-worker`, but the current span will never include one of those: the span is closed immediately after those events are added. When the kernel replays a transcript to bring a worker up to date, that replay will never see 'save-snapshot' or 'shutdown-worker'. But an external tool which replays a historical span will see them at the end. The `initialize-worker` event contains `workerOptions` (which includes which type of worker is being used, as well as helper bundle IDs like lockdown and supervisor), as well as the `source.bundleID` for the vat bundle. The `save-snapshot` event results contain the `snapshotID` hash that was generated. The `load-snapshot` event includes the `snapshotID` in a record that could be extended with additional details in the future (like an xsnap version). The types were improved to make `TranscriptDelivery` be a superset of `VatDeliveryObject`. We also record TranscriptDeliveryResult, which is currently a stripped down subset of VatDeliveryResult (just the "ok" status), plus the save-snapshot hash. In the future, we'll probably record the deterministic subset of metering results (computrons, maybe something about memory allocation). In the slog, the `heap-snapshot-save` event details now contain `snapshotID` instead of `hash`, to be consistent. Previously vat-warehouse used `lastVatID` to track which vat received a delivery most recently, and `saveSnapshot()` used that to decide which vat requires a snapshot. This commit changes that path to be more explicit, and removes `lastVatID`. refs #7199 refs #6770

warner · 2023-04-28T18:15:26Z

Both heap snapshot saves (with snapshotID hashes in the results) and computrons spent during delivery (also in the results) are now included in the transcript, thanks to #7484.

We also add load-snapshot transcript entries, whose arguments include the snapshotID. These entries are always included, immediately after the save-snapshot (but in the subsequent transcript span), even though the kernel is free to either continue using the existing worker, or to discard the worker and launch a new one. Thus the kernel's worker-reuse policy is not part of consensus.

All transcript entries (including results) are folded into the current-span hash, which causes an update to the swing-store export data, which makes them part of consensus.

Declaring victory on this one.

mhofman added enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE labels Jan 10, 2023

This was referenced Jan 13, 2023

snapshot / BOYD interval based on computrons #6786

Open

Liveslots finalizers run metered #6795

Open

warner self-assigned this Feb 8, 2023

mhofman mentioned this issue Mar 21, 2023

Capture start/upgrade information in transcript #7199

Closed

gibson042 self-assigned this Mar 31, 2023

mhofman mentioned this issue Apr 10, 2023

Better represent heap cost in run policy #7373

Open

warner unassigned gibson042 Apr 16, 2023

warner mentioned this issue Apr 23, 2023

add transcript events: init, snapshot save/load, shutdown #7484

Merged

ivanlei added this to the Vaults EVP milestone Apr 27, 2023

warner closed this as completed Apr 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include extra replay data in vat transcripts #6770

Include extra replay data in vat transcripts #6770

mhofman commented Jan 10, 2023

mhofman commented Feb 8, 2023

warner commented Apr 16, 2023

warner commented Apr 28, 2023

Include extra replay data in vat transcripts #6770

Include extra replay data in vat transcripts #6770

Comments

mhofman commented Jan 10, 2023

What is the Problem Being Solved?

Description of the Design

Observation

Test Plan

mhofman commented Feb 8, 2023

warner commented Apr 16, 2023

warner commented Apr 28, 2023