-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform snapshots only after BOYD #7504
Comments
Extracted from #6786 |
I've implemented a form of this, but I changed the causality. Instead of inhibiting snapshots to only happen after BOYD, I change the snapshot logic to always perform a BOYD first. So snapshots will happen according to the usual The limitations of my PR:
|
This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7533 closes #7504
This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7553 closes #7504
This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7553 closes #7504
This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7553 closes #7504
This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7553 closes #7504
What is the Problem Being Solved?
Moddable added a feature to only clean out WeakRef during organic gc to mitigate our known issue where liveslots exposes its sensitivity to gc in syscalls (#6784).
However one case where a forced full gc is performed is during a snapshot. The problem is that if we perform an XS upgrade that has a difference in allocation behavior, it may cause different syscalls when replaying the transcript to get the vat online if some representatives or presences were released by the program between the last BOYD and the time the snapshot was taken after.
Description of the Design
Remove snapshot scheduling, only perform snapshot after delivering a bringOutYourDead.
Possibly force a bringOutYourDead immediately after vat start to create a snapshot of the initialized vat like we have today.
Security Considerations
None
Scaling Considerations
None
Test Plan
TBD
The text was updated successfully, but these errors were encountered: