-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collection.clear() fails to decref objects used as keys, or delete the reverse-mapping vatstore entries #8756
Comments
We could check the need for remediation by building an external refcount checker: a tool that walks the DB and looks at each vat's |
Why does the version accepting a pattern need to construct representatives? Is it because the pattern may require of a specific instance instead of a generic "remotable"? I'm wondering if we could optimize the clear with pattern to avoid creating representative for each keys as that would cause GC pressure (representatives are only collected at the next BOYD)
This effectively puts a cap on the number of object references in a collection ever (not at a given point in time, but over time), right? The value might be large enough, but there are ways to encode that keep sorting, and increase the size as needed. cc @gibson042 who's a specialist in these encoding tricks.
This seems to lack the reference counting syscalls made when inserting the object in the collection?
One thing that concerns me with this whole scheme is that it leaves no space for composite keys. This will effectively allow the same vref in multiple collection entries. Either we re-use the remotable reference create for it (unlikely) and need to count how many times that remotable reference is used, or we need to accept that a single vref may have multiple object references in the collection. I suppose the latter could be done by transforming the reverse mapping into a list of ordinals. |
Fix collection deletion. Previously, objects (passables/vrefs) used as collection keys were not de-referenced when the collection was dropped/deleted or unconditionally cleared (`collection.clear()` without keyPatt or valuePatt). Strong collections would leak a reachable-refcount to the key object, and weak collections would leak a recognizable-refcount, leading to the object being kept alive (or kept unretired) forever. In addition, each entry would leak a vatstore key until the collection was finally deleted. The code in `clearInternalFull()` confused dbKeys, encodedKeys, and vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so the answer was always "no". fixes #8756
Fix collection deletion. Previously, objects (passables/vrefs) used as collection keys were not de-referenced when the collection was dropped/deleted or unconditionally cleared (`collection.clear()` without keyPatt or valuePatt). Strong collections would leak a reachable-refcount to the key object, and weak collections would leak a recognizable-refcount, leading to the object being kept alive (or kept unretired) forever. In addition, each entry would leak a vatstore key until the collection was finally deleted. The code in `clearInternalFull()` confused dbKeys, encodedKeys, and vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so the answer was always "no". fixes #8756
New notes from my extensive-but-redundant writeup in #9959 : The deeper improvements are:
MeasurementThe walkthrough above shows that We could/should also build a tool to scan the whole vatstore and extract the durable reference graph, then compare it against the refcounts to see if they match. Entries which have been mishandled this way will have had their RemediationMessy but it could be worse. We could take advantage of the leftover ordinal records to get a list of the problematic vrefs, decref them, and delete the ordinal record. We'd do this with a vatstore version bump, so we know if/when it has been done, to avoid doing the remediation more than once. |
Fix collection deletion. Previously, objects (passables/vrefs) used as collection keys were not de-referenced when the collection was dropped/deleted or unconditionally cleared (`collection.clear()` without keyPatt or valuePatt). Strong collections would leak a reachable-refcount to the key object, and weak collections would leak a recognizable-refcount, leading to the object being kept alive (or kept unretired) forever. In addition, each entry would leak a vatstore key until the collection was finally deleted. The code in `clearInternalFull()` confused dbKeys, encodedKeys, and vrefs, and gave the wrong kind of key to `isEncodedRemotable()`, so the answer was always "no". fixes #8756 move test, new naming convention more test, needs refactor refactor test finish weak-collection test fix: retire objects from weakstore recognizers add test of voAwareWeakSet deletion and retirement simplify add/removeRecognizableVref These functions have two purposes: * track "recognition records" for keys of weak collections * tell BOYD about Presence vrefs (addToPossiblyRetiredSet) when a record is removed Virtual/durable weak stores keep their recognition records in vatstore `vom.ir.${vref}|${collectionID}` keys. voAwareWeakMap/Set keeps them in RAM, in a Map named vrefRecognizers, where the key is the vref being recognized, and the value is a Set of virtualObjectMaps (one per recognizing WeakMap/Set). Adding a record means adding a `vom.ir.` key, or adding to the Set (which might require adding a Map entry). Removing a record means removing a `vom.ir.` key, or removing from the Set (and maybe the Map entry). We need to add the vref to possiblyRetiredSet when both 1: it is a Presence-style vref (o-NN), and 2: it might be the last recognizer. For `vom.ir.` keys, we do this for every Presence-style vref, whether or not we're the last one, because searching for other records would be expensive. For voAwareWeakMap/Set, we have to delete the Map entry if the Set became empty anyways, so we only addToPossiblyRetiredSet if it was the last one. The previous version only updated records for Presence-style and Representative-style (virtual/durable object) vrefs, and avoided updating records for Remotable-style vrefs. This changes the logic to update records for all vrefs, so that retiring a key can remove the weakstore entry (and its associated data, for WeakMapStores). This causes two behavioral changes: * add vom.ir. keys for Remotable vrefs * only addToPossiblyRetiredSet for presence-style vrefs in weakstores make virtualReferences right more more more
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Remotable-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Remotable/Representative/Presence-style keys, consuming c-list entries longer than necessary. * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has() to incorrectly return true, and .init() to throw a "already registered" error, as well as consuming DB space longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
PR #9961 will fix this. Calls to As a change to liveslots, this requires two steps to fix any given (deployed) vat. First, the kernel must be restarted with a version of Note that any collections which were The #8759 tool will check for this kind of corruption, so if you've got a DB and are worried about it, run that tool against your DB (once it exists). You could manually remediate this by deleting the ordinal-assignment records that don't match, which would fix the correctness-breaking corruption (even though there might still be refcount corruption, which causes objects to be retained incorrectly, but that's just a space leak). I'll close this once the PR lands. |
This was broken because of #8756 . The modified tests are marked as failing, until the fix is landed.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
This was broken because of #8756 . The modified tests are marked as failing, until the fix is landed.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
This was broken because of #8756 . The modified tests are marked as failing, until the fix is landed.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
This was broken because of #8756 . The modified tests are marked as failing, until the fix is landed.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @mhofman and @gibson042 for recommendations.
This was broken because of #8756 . The modified tests are marked as failing, until the fix is landed.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
Yep, exactly.
I guess, if we made some sort of "virtual pattern" or "virtual matcher" that could remember its vref without actually holding a Representative.
Yeah, this week we talked about his "unary length encoding" scheme that would remove that arbitrary limit.
I omitted that from the writeup. The actual if (passStyleOf(key) === 'remotable') {
...
if (hasWeakKeys) {
vrm.addRecognizableValue(key, `${collectionID}`, true);
} else {
vrm.addReachableVref(vref);
}
Yeah, I don't know how composite keys are going to work. Maybe we assign an ordinal to each composite key, have a secondary table that tracks the relationship between ordinals and the vrefs that go into them (and vice versa, probably many-to-many). It will probably become easier if/when we can take better advantage of SQLite, and use real column indexes. |
While working on rate-limited GC, I discovered that calling
.clear()
method on a ScalarBigMapStore or ScalarBigSetStore would fail to delete any vref-bearing objects used as keys of the collection.Background: How Virtual Collections work
Our
.clear()
method is fancy: it accepts a "key pattern" and "value pattern" to limit the scope of what gets cleared. When these are used,clear()
must iterate through each entry, compare the key/value against the pattern, and only delete the ones that match.When we omit those patterns, or when we're deleting the collection entirely (because nobody is referencing it anymore), we use a specialized function named
clearInternalFull()
, which avoids deserializing the keys/values first. This is faster: we need to know what the keys/values point to (so we can decrement their refcounts), but we don't need to construct any new Representatives.Collections are encoded as one vatstore record per collection entry. The vatstore key is an encoded form of the collection key, and the vatstore value is the JSON-serialized marshalling of the collection value. We use the
encodePassable
from@endo/marshal
to encode the collection key into a string, which can then be further encoded into a vatstore key.The collection key might be an object, with a pass-by-reference identity. These objects all have vrefs: values like
o-12
for the Presence you get when importing an object from the kernel,o+13
for the ephemeral Remotable you create withFar()
and might export into the kernel, oro+d14/2
for a Representative of one instance of a virtual/durable Kind that might also be exported into the kernel.We need virtual collections to enumerate in a useful and stable order, and we use the vatstore ordering directly as the collection's ordering. We rely upon
encodePassable
to provide encoded keys that maintain this order. For example, all BigInts are encoded into a form that has a length prefix on the MSB side, so that19
sorts later than2
, and it uses an
orp
prefix for negative and positive numbers to ensure15
sorts later than-15
.When
encodePassable
is given a pass-by-reference object (with a vref), we cannot use the vref as the sort order. Vrefs are local to the vat, so different vats would observe different orderings. In addition, we keep vrefs private to prevent an information leak (you could learn how many objects have been allocated, which would enable both eavesdropping and a covert communication channel).Instead, we assign each object-used-as-key an "ordinal" integer. This gives us insertion order, just like JavaScript's native Map and Set collections. For as long as the object is a key of that collection, it will retain the same ordinal. We use a second vatstore record to map from the vref to its ordinal (the key is prefixed with
|
, which is not used byencodePassable
). And the main entry's key is a combination (safe concatenation) of the ordinal plus the vref, so we don't need a third entry to remember the ordinal->vref mapping.Suppose
collection
is collection number 5 (whose data all uses a vatstore prefix ofvc.5.
), and userspace defines anadd
method like:If
add(p)
is invoked with a Presence whose vref iso-12
:1
too-12
(if it is the first one added)encodePassable
tells us to use anencodedKey
ofr0000000001:o+v10/1
r
prefix (marking this key type as an object, rather than a string or Number or BigInt, etc), the fixed-length ordinal (providing insertion order), and the original vref (for recoverability during.keys()
or.entries()
)dbKey
ofvc.5.r0000000001:o+v10/1
'data'
into something like{ body: '"data"', slots: [] }
vatstoreSet('vc.5.r0000000001:o+v10/1', data)
vatstoreSet('vc.5.|o+v10/1', '1')
, for the reverse recordHow clearInternalFull Works
Our
clearInternalFull
uses a loop like this:We iterate through the DB with
enumerateKeysStartEnd
to find all the forward-mapping keys (likevc.5.r0000000001:o+v10/1
). For each one, we fetch its value, and run all the value's slots throughvrm.removeReachableVref()
, to decrement refcounts on anything pointed to by the value. Then we delete the forward mapping withvatstoreDelete
.Then we use
isEncodedRemotable(dbKey)
to decide whether the key was a pass-by-reference value. If so, we usevrefFromDBKey
to convertvc.5.r0000000001:o+v10/1
intoo+v10/1
, and decrement either its reference count or its recognizability count (depending upon whether the collection was strong or weak). Finally, we delete the reverse mapping with anothervatstoreDelete
.The Bug
The problem is that
isEncodedRemotable(dbKey)
is being passedvc.5.r0000000001:o+v10/1
, but is expectingr0000000001
. This function is coming from the encode-passble library of@endo/marshal
, and doesn't know anything about our database keys. It will always answerfalse
, because liveslots is always using dbKeys that start with the virtual-collection prefix (vc.
something), and the only encode-passable prefix that is used for "encoded remotables" isr
.As a result,
clearInternalFull()
will never perform the key-deletion code for passables used as keys, with two consequences:The vatstore entry will be removed if/when the collection is deleted (not merely cleared), because there is extra code there to delete all the
vc.5.|
-prefixed metadata keys, and that happens to include the reverse-mapping keys. But the only other way to delete that DB space (or to drop the refcounts of the key objects) is to delete the entire vat.The
if
clause being skipped had a second bug: it callsvrefFromDBKey(dbKey)
to get the vref to use, but that function is:The
dbKey
actually starts with the collection prefix, likevc.5.
, but this function is only given the job of stripping off ther0000000001:
encode-passable tag. So it will return corrupted data.The only other caller of
vrefFromDBKey
is in the very next function:where it does get a prefix-stripped value. The core problem is naming: we need separate functions, with correct names and argument types:
Incidentally, we already have an
encodedKeyFromDBKey
, it is nameddbKeyToKey
:The Fix
vrefFromEncodedKey
and fixvrefFromDBKey
isEncodedRemotable(dbKey)
to beisEncodedRemotable(dbKeyToKey(dbKey))
To deploy the fix into any given vat, we must perform a liveslots-bumping vat upgrade.
Remediation
Fortunately, our production code does not delete or clear entire collections very frequently.
Unfortunately, it is not easy to search the code for places that might delete a collection. We can grep for
clear()
, or calls toclear(keyPatt, valuePatt)
(and filter on cases where both patterns are empty). But that won't reveal places where a collection is dropped and then gets deleted when liveslots notices all its pillars have fallen.We can scan the slogfiles for places where two encoded-passable keys are deleted in quick succession: eyeballing the vatstore calls should make it pretty obvious if we're looking at a
clearInternalFull()
.I think it's entirely possible that we haven't run into this problem in production yet, which would be lovely.
But if we do have lost keys, we'll need to come up with a remediation plan. Maybe we should change
deleteCollection
to look for the reverse-mapping keys as it does it's final metadata sweep, and instead of merely deleting the key, it should also decrement it's reachability/recognizability count. If the entry was deleted properly (by the non-full
form ofclearInternal
), the refcount was changed, and the reverse-mapping key was deleted. So the only case where the key is still present should also mean the refcount was not changed.Then the remediation process is to somehow get userspace to delete the affected collections.
We might also change
clear()
to scan for leftover reverse-mapping keys, after it's processed and deleted everything else, and convince userspace to merely.clear()
the affected collections.These would only work if the collection were still reachable by userspace. A coding pattern that re-uses the same collection but periodically clears it would probably fix itself the next time it gets cleared. But if the collection is deleted by GC, then we've lost the data that shows where refcounts came from, and remediation will have to wait for liveslots to acquire a full mark-and-sweep garbage collector (which would discover the discrepancy between the refcount and the set of references, and could correctly mark the object for deletion).
And of course, we must upgrade the vats in question to a new version of liveslots, before any of these remediations could be applied, just to get the new code into place.
Deeper Fixes
Code-coverage would probably have spotted this: I don't think there's any way we could reach that
if
clause, and we certainly have tests that use objects as keys. So improving our coverage infrastructure would be a big help.I'm writing more tests that examine the syscalls triggered by deletion, which is how I discovered this one. Maybe more tests of that flavor.
Stronger types might have also helped: the core confusions are that
isEncodedRemotable
takes anencodedKey
but is given adbKey
, and thatvrefFromDBKey
is written to expect anencodedKey
but its name and signature claim that it takes adbKey
. However, since these are all strings, I don't think the "structural types" that TypeScript offers would be sufficient to catch the bug; it would need "nominal types" to distinguish between thedbKey
kind of string, theencodedKey
kind of string, and thevref
kind of string.The text was updated successfully, but these errors were encountered: