Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleting a virtual weakmap fails to retire imported vrefs used as keys #7355

Closed
warner opened this issue Apr 7, 2023 · 4 comments
Closed
Assignees
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Apr 7, 2023

Describe the bug

While investigating #7353 I realized that we have another problem in the liveslots GC code. The setup is:

  • use makeScalarBigWeakMapStore() to construct a virtual WeakMap A
    • we hold the Representative in RAM for now
  • use an imported Presence B as a key, with some arbitrary value
    • A is said to "recognize" B, i.e. A is a member of the recognizer set for B
    • liveslots uses a vom.ir.${vrefB}|${collectionID} vatstore key to remember the recognition reference
  • now drop Presence B from RAM, and do a BOYD
    • liveslots should send a syscall.dropImports, because it no longer has a strong reference to B anywhere (no RAM pillar, it's an import so there's no export pillar, and it wasn't stored strongly in vdata)
  • pretend the exporting vat does not delete the object just because our vat under test dropped its import
    • i.e. do not send a dispatch.retireImports in
    • now A continues to recognize B
  • now drop the A Representative from RAM, and do a BOYD
    • this should perform vatstore syscalls to delete all the contents and metadata of the now-deleted collection
    • when it deletes the key B, it removes the recognition reference

Removing the last recognition reference should emit a syscall.retireImports for B's vref.

The bug is that we don't send that retireImports.

Liveslots has a possiblyRetiredSet, with an addToPossiblyRetiredSet() function to add vrefs to it. This is called from only one place, removeRecognizableVref(), which only uses it on imported Presences and local virtual/durable Representatives, and only if the recognizer is a VirtualObjectAwareWeakMap/Set (i.e. ! recognizerIsVirtual):

  function removeRecognizableVref(vref, recognizer, recognizerIsVirtual) {
    const { type, allocatedByVat, virtual, durable } = parseVatSlot(vref);
    if (type === 'object' && (!allocatedByVat || virtual || durable)) {
      if (recognizerIsVirtual) {
        syscall.vatstoreDelete(`vom.ir.${vref}|${recognizer}`);
      } else {
        const recognizerSet = vrefRecognizers.get(vref);
        assert(recognizerSet && recognizerSet.has(recognizer));
        recognizerSet.delete(recognizer);
        if (recognizerSet.size === 0) {
          vrefRecognizers.delete(vref);
          if (!allocatedByVat) {
            addToPossiblyRetiredSet(vref);
          }
        }
      }
    }
  }

scanForDeadObjects() uses this set, with some checks against getValForSlot, deadSet, and vrm.isVrefRecognizable(), to decide whether these vrefs should be retired or not. This clause only appears to act upon imported Presence vrefs.

      for (const vref of possiblyRetiredSet) {
        // eslint-disable-next-line no-use-before-define
        if (!getValForSlot(vref) && !deadSet.has(vref)) {
          // Don't retire things that haven't yet made the transition to dead,
          // i.e., always drop before retiring
          // eslint-disable-next-line no-use-before-define
          if (!vrm.isVrefRecognizable(vref)) {
            importsToRetire.add(vref);
          }
        }
      }
      possiblyRetiredSet.clear();

A later clause offers a second path into importsToRetire: if a Presence is in deadSet (it's RAM pillar has dropped and stayed down), and if the vref is not reachable by vdata, and it is also not recognizable by vdata, then the vref will be retired:

        } else {
          // Presence: send dropImport unless reachable by VOM
          // eslint-disable-next-line no-lonely-if, no-use-before-define
          if (!vrm.isPresenceReachable(baseRef)) {
            importsToDrop.add(baseRef);
            // eslint-disable-next-line no-use-before-define
            if (!vrm.isVrefRecognizable(baseRef)) {
              // for presences, baseRef === vref
              importsToRetire.add(baseRef);
            }
          }

This second clause provides a fast path for the most common case: the imported Presence was used in neither Maps nor WeakMaps: it only had a RAM pillar, so our vat's ability to recognize it disappears along with the Presence, and we should emit both dropImports and retireImports in the same BOYD. It also covers another common case, where it was used in only a WeakMap, so losing the RAM pillar mandates a dropImport, but we continue to recognize it, and wait for the exporting vat to retire it with a dispatch.retireImports.

The case that's missing is when the recognizability is lost because our recognizer is deleted, after the Presence has gone.

Fixes

I'm still trying to think about the right way to fix this. I'm pretty sure that removeRecognizableVref() should be calling addToPossiblyRetiredSet() for both the recognizerIsVirtual and ! recognizerIsVirtual cases: when we delete the vom.ir. key, we should be scheduling a check at the next BOYD, in case that was the only remaining key. And then scanForDeadObjects should be doing an independent vrm.isVrefRecognizable() check on everything in possibleRetiredSet.

What I'm not yet clear on is how the first scanForDeadObjects clause should be changed.

"reachable" and "recognizable" are two of three possible states: it's really one of "no reference", "recognizable but not reachable", and "both recognizable and reachable". So we've got a reference graph whose edges come in two colors, or we have one (smaller) reference graph overlaid on top of a second, depending on your way of thinking.

Virtual objects have three possible pillars (RAM/Representative, export, vdata), of which both export and vdata come in both reachable and recognizable flavors (the JS engine handles reachability of Representatives, and we've commandeered everything userspace might get that could let them express mere recognizability). Imported Presences lack the export pillar, but the export still has a "not retired yet" attribute that we must pay attention to. Local/precious/ephemeral Remotables are held strongly when a virtual collection is holding their vref, so they generally aren't tracked with the notion of pillars.

So we kind of have two separate state machines, one for virtual objects (where a Representative is the RAM pillar), and a second for imports (where a Presence serves that role). But we'd like to share as much of their mechanics as we can, like possibleRetiredSet.

@warner warner added bug Something isn't working SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Apr 7, 2023
@warner warner self-assigned this Apr 7, 2023
warner added a commit that referenced this issue Apr 7, 2023
@warner
Copy link
Member Author

warner commented Apr 7, 2023

I've added a failing test in 7355-presence-retirement-bug .

@warner
Copy link
Member Author

warner commented Apr 7, 2023

If this bug is present, and the vat fails to send its syscall.retireImports, then sooner or later the kernel will send a dispatch.retireImports. I've udpated my branch to confirm that this doesn't cause any obvious problems: I think retiring an unknown vref is just ignored by liveslots.

Given that, I think we can defer fixing this for a while. The missing retireImports would have removed a c-list entry, and maybe a kernel object-table entry (if the object were unreachable by all vats but still had identity in the exporting vat). But it wouldn't have triggered any cross-vat GC: the buggy vat isn't holding a strong reference to the import. And it wouldn't have triggered any cleanup within the buggy vat itself: we only get into this situation if the weak collection is deleted, which means we've already deleted the value that the non-retired key was pointing to.

warner added a commit that referenced this issue Apr 9, 2023
warner added a commit that referenced this issue Apr 9, 2023
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Remotable-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Remotable/Representative/Presence-style keys, consuming c-list
  entries longer than necessary.

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has() to incorrectly return true, and
  .init() to throw a "already registered" error, as well as consuming
  DB space longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
@warner
Copy link
Member Author

warner commented Aug 28, 2024

PR #9961 will fix this problem. I'll close this ticket once that PR lands.

The PR doesn't attempt to remediate any pre-existing consequences of this bug. Imported ("Presence-style") vrefs that were not retired will still not be retired. Fortunately the consequences of this are small: the kernel will maintain a "state: RECOGNIZABLE" c-list entry for the importing/recognizing vat (two kvStore keys), which could otherwise be dropped. If this vat was the only one still recognizing the object, the kernel will also maintain an object-table record for it (another two kvStore keys), and will maintain a c-list entry for the exporting vat (yet another two keys). This bookkeeping data is the only unnecessary storage consumption: no userspace-visible data will be retained by the RECOGNIZABLE state.

The #8759 tool (once it exists) will let you audit an existing kernel DB for mismatches between c-list and refcount, which may help determine the scale of the problem. Remediation from within a vat seems unlikely to be cheap enough to perform (it amounts to a mark-and-sweep refcount audit, plus a new vat-kernel API to learn the import/c-list status of a vref). But we're considering an approach that would allow an external party to submit alleged refcount fixes, to notify a vat of a subset of the vrefs that are worthy of analysis and updates. That would allow the expensive work to be done off-chain, and the corrections submitted as transactions, and safely verified before applying the recommended changes.

Also note that once a vat is deleted, its state will no longer be corrupted. Replacing a vat with a new clean one is another form of remediation.

@warner
Copy link
Member Author

warner commented Aug 28, 2024

Notes for the DB scanner tool:

  • check the c-list for vrefs in the RECOGNIZABLE state
  • for each vref, look for all the virtual/durable "recognition records", vc.ir.${vref}|${collectionID}
  • the vrefRecognizers Map holds ephemeral recognition records: recognizable vrefs are keys of this Map
    • the best way I can think of to find these is to parse the latest heap snapshot, get a list of all its String chunks, and see if the vrefs are present in that table
    • of course, the snapshot may be several hundred deliveries out of date, and we don't have a way to query the current RAM image of the worker
    • so also check the transcript for deliveries made since that heap snapshot
    • if the vref was not re-introduced in delivery arguments, the vref will not be in the current heap
    • pay attention to the difference between vref and baseRef: if the recognized vref is facet-1 of a multi-facet cohort, and a message arrives which references facet-2, methods on facet-2 can get access to facet-1 (or any other facet), and add it to a WeakStore
    • so the tool should signal a concern if the deliveries include any facet vref that shares a baseRef with the facet vref that appears in the c-list in the RECOGNIZABLE state
  • see the virtualReferences.js comment on vrefRecognizers for details about the data structures and DB records involved

warner added a commit that referenced this issue Aug 29, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @mhofman and @gibson042 for recommendations.
@mergify mergify bot closed this as completed in 97e81f1 Aug 31, 2024
mergify bot added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

1 participant