Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Remotable as weak collection key causes value to be pinned forever #9956

Closed
warner opened this issue Aug 24, 2024 · 5 comments
Closed
Assignees
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Aug 24, 2024

While investigating #9939 I found a second bug:

  • create rem = Far('iface', {}), the kind of ephemeral referenceable object that liveslots calls a "Remotable", with a vref like o+1
  • create a virtual weak Map-like collection, wms = makeScalarBigWeakMapStore()
  • define some data, perhaps with an imported Presence
  • store it under the Remotable key with wms.init(rem, data)
  • now let rem call out of scope, and wait for a BOYD to garbage-collect it

The expected behavior is that the store data is collected too. However, because of the bug, the data remains alive. In fact, the entire key+value entry remains in the weak collection (indexed by the Remotable's o+1 vref). That will keep a Presence alive, which may keep data alive in other vats.

I think the problem is in addRecognizableValue. Note that the (type === 'object' && (!allocatedByVat || virtual || durable)) condition accepts Presences (which are not allocatedByVat) and virtual/durable Representatives, but excludes Remotables. And there is no else clause afterwards. So addRecognizableValue(remotable) will just be ignored.

function addRecognizableValue(value, recognizer, recognizerIsVirtual) {
const vref = getSlotForVal(value);
if (vref) {
const { type, allocatedByVat, virtual, durable } = parseVatSlot(vref);
if (type === 'object' && (!allocatedByVat || virtual || durable)) {

That means we don't record any pointers from the Remotable's vref to the collection whose entry ought to be deleted once the vref is retired. So the entry will be left forever, or at least until the entire collection is cleared or deleted.

The function's comment reveals that addRecognizableValue was written to support voAwareWeakMap/Set, in which any Remotables used as keys are stored strongly, as the key of a Map. When we started calling addRecognizableValue from the virtual-weak-collection code, we didn't revisit its limitations.

* collections. Specifically, the vrefs correspond to imported
* Presences or virtual-object Representatives (Remotables do not
* participate: they are keyed by the actual Remotable object, not
* its vref). The collections are either a VirtualObjectAwareWeakMap
* or a VirtualObjectAwareWeakSet. We remove the entry when the key

Impact

This isn't a safety issue, because by the time this bug is triggered, the key is already unreachable (by definition), so the matching value is unretrievable: no userspace code can generate the Remotable to do a .get(), and of course weak collections are non-enumerable.

But it is a performance issue, specifically a disk-space consumption problem. As a virtual (rather than ephemeral) collection, the values are stored in the vatstore database, and we'll use more DB space than we're supposed to (but since the values must also be virtual, no RAM will be consumed).

The bug cannot occur for a fully-durable collection, because we refuse to accept non-durable keys (or values), and Remotables are not durable. It doesn't appear for strongly-keyed collections, of course, because the key is held strongly, and won't go away until someone explicitly deletes it from the collection (or clears/deletes the whole collection). It only affects merely-virtual weak-keyed collections, so only makeScalarBigWeakMapStore and makeScalarBigWeakSetStore. And only the Map form can entrain remote objects.

We don't use merely-virtual stores very much (when we go virtual, we go durable), so the immediate impact is probably low.

Likely Fix

I think we'll need to add vom.ir.${vref} keys for these o+1 -format Remotable vrefs. That will require changes to scanForDeadObjects to make sure we're reacting to their retirement properly.

@warner warner added bug Something isn't working SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Aug 24, 2024
warner added a commit that referenced this issue Aug 24, 2024
@warner
Copy link
Member Author

warner commented Aug 24, 2024

@warner
Copy link
Member Author

warner commented Aug 24, 2024

The remediation will be for a BOYD call to walk all weak collections, scanning their ordinal table for o+NN-format vrefs, checking valForSlot(vref) on each, and if the Remotable is missing, remove that entry from the collection (which will free up its value). It should not add the vref to possiblyRetiredSet.

Liveslots knows exactly which KindIDs are for weak collections (there are only four of them), so the scan will start with their ordinal table (prefix vc.${collectionID}.|o+), and exclude the Representatives (o+dNN or o+vNN). So vc.${collectionID}.|o+0 is the inclusive beginning, and vc.${collectionID}.|o+: is the exclusive end. That costs one vatstoreGetNextKey for each of the four collections, plus another for each o+NN vref used as a key. That should be pretty cheap, as long as this isn't particularly widespread.

I think this will count all the occurrences:

% sqlite3 swingstore.sqlite 'select * from kvStore' >all-kvStore.txt
% cat all-kvStore.txt |grep -e '^v\d\+\.vs\.vc\.\d\+\.|o+\d' >bug9956.txt

but it's showing zero hits for a recent mainnet DB snapshot. That's entirely possible, we mostly use durable collections, not merely-virtual ones, so we aren't in the habit of using non-durables as keys.

The remediation should also set a flag in the vatstore (maybe vom.version?) to indicate two things:

  • all remediation has been completed
  • all future Remotable keys will be given vom.ir.${vref}|${collectionID} recognition records

The BOYD should not perform the remediation if the version is high enough. Newly initialized vats should start with the higher version (and never do remediation).

My fix for #9939 will probably add code to scanForDeadObjects that will check for any recognition records, and remove the entry if one is present. That clause won't have anything to do until after this remediation, but shouldn't hurt anything.

@warner
Copy link
Member Author

warner commented Aug 24, 2024

Oh that grep is overbroad, it will count Remotable keys in all collections, and we only actually care about the weak collections. Those might technically be different for each vat, vNN.vs.storeKindIDTables holds the mapping, but the current version of liveslots always allocates them at startup, and always gets the same values:

{"scalarMapStore":2,"scalarWeakMapStore":3,"scalarSetStore":4,"scalarWeakSetStore":5,"scalarDurableMapStore":6,"scalarDurableWeakMapStore":7,"scalarDurableSetStore":8,"scalarDurableWeakSetStore":9}

so for purposes of estimating remediation work, we could just look at 3/5/7/9.

warner added a commit that referenced this issue Aug 26, 2024
dropping a Remotable now checks for local recognizers, in preparation
for fixing #9956
warner added a commit that referenced this issue Aug 26, 2024
These tests will fail without the fixes in the next commit

One new test is disabled because of additional pending bugs that
interfere with the test setup (a combo of #9956 and #9959).
@warner warner self-assigned this Aug 26, 2024
@warner
Copy link
Member Author

warner commented Aug 26, 2024

The symptoms of this one overlap with #8756 : the key might get retained because deleting the collection didn't work, or the value might get retained because it's Remotable and the collection is weak and the key got retired and we don't have enough recognition records to know which collection to delete it from.

warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Remotable-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Remotable/Representative/Presence-style keys, consuming c-list
  entries longer than necessary.

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has() to incorrectly return true, and
  .init() to throw a "already registered" error, as well as consuming
  DB space longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 29, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
mergify bot pushed a commit that referenced this issue Aug 30, 2024
These tests will fail without the fixes in the next commit

One new test is disabled because of additional pending bugs that
interfere with the test setup (a combo of #9956 and #9959), which will
be re-enabled in PR #9961 (to address #8756 and others)
warner added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @mhofman and @gibson042 for recommendations.
@mergify mergify bot closed this as completed in 97e81f1 Aug 31, 2024
mergify bot added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
@warner
Copy link
Member Author

warner commented Aug 31, 2024

Impact measurement requires an audit of the vatstore, looking for two properties:

  • every vref key of a virtual/durable weak collection should either be reachable, or have an .ir recognition-record

Remediation will require that check, followed by either deleting the entry (for non-reachable keys) or adding the .ir record (for still-reachable ones).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

1 participant