-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build external refcount checking tool #8759
Comments
If we're gonna do this, could we not also do a full trace to detect orphans and unreferenced cycles? |
My immediate need is to tell whether #8756 needs remediation, or if we can simply deploy a fix and not worry about cleaning up old data, and that just need to look for refcount discrepancies. This tool would certainly be the basis for a full mark-and-sweep trace, but I'd consider it extra credit, and I'd like whoever implements this to answer the remediation question before doing the extra work. |
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Remotable-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Remotable/Representative/Presence-style keys, consuming c-list entries longer than necessary. * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has() to incorrectly return true, and .init() to throw a "already registered" error, as well as consuming DB space longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @gibson042 for recommendations.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @mhofman and @gibson042 for recommendations.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of: * fixes #8756 * fixes #7355 * fixes #9956 As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools. This commit marks the new tests as expected to pass again. It adds one new (failing) test to demonstrate the lack of remediation code. Thanks to @mhofman and @gibson042 for recommendations.
Liveslots was suffering from the following bugs: * vrefs weren't properly encoded into DB keys and "encodePassable" keys when deleting collections (such that `isEncodedRemotable()` always answered false, which skipped the refcount processing of keys, and removal of the "ordinal-assignment" record) * recognition records (vom.ir. keys) weren't created or removed for Remotable-style keys in weak collections * Presence-style vrefs were not submitted for retirement processing when deleting a weak collection As a result: * `collection.clear()` left the "ordinal-assignment"` records in the vatstore, causing subsequent .has(oldkey) to incorrectly return true, and .init(oldkey, newvalue) to throw a "already registered" error, as well as consuming DB space longer than necessary. * Invoking the `.clear()` method on a virtual/durable collection, or dereferencing the collection itself (and allowing it to be garbage collected), did not free any Remotables, virtual/durable objects (Representatives), or imported Presences used as keys. This could cause data in other weak collections (or other vats) to be retained longer than necessary. * retiring (deleting) a Remotable used as a key in a weak virtual/durable collection did not free the corresponding value, causing data (in local and/or remote vats) to be retained longer than necessary * Allowing a weak virtual/durable collection to be garbage collected did not inform the kernel that the vat can no longer recognize the Presence-style keys, consuming c-list entries longer than necessary. This commit fixes those bugs, which fixes the immediate cause of issues #8756 , #7355 , and #9956 . As a change to liveslots, full deployment requires both restarting the kernel with this new code, *and* triggering a vat upgrade of all vats. Once that is done, no new collection entries will suffer the problems listed above. However, this commit makes no attempt to remediate any existing data corruption. See #8759 for plans to build a tool that can audit the virtual-data reference graph and detect the consequences of these bugs in pre-existing kernel databases, and see the issues listed above for notes on efforts to build remediation tools.
Talking with @siarhei-agoric , we decided that many modes of this tool do not really need a separate intermediate DB. For example, auditing for refcount consistency within a single vat can do:
That requires RAM in proportion to the number of vrefs in use by the vat, perhaps 1-10M for the larger ones, but does not require storing full details about each edge. If the tool detects a mismatch, we could re-run it in a verbose mode, with a specific set of vrefs to look for (repeating the full walk to gather more data), but with luck it will report "all refcounts match" and we don't need to spend the extra effort. |
As mentioned in #8756 :
The task for this ticket is to build that tool, and run it against the current mainnet database, to see if we have any problems. I suspect everything is fine, but it might show evidence of #8756, or of some other as-yet-discovered GC problem.
The tool should check both kernel and vat refcounts for consistency. The starting point should be a copy of the kvStore (just a new DB with
CREATE TABLE kv (key STRING, value STRING)
and a copy of the wholekvStore
), which will be a lot smaller and easier to work with than the full swingstore.Then I'd parse that data into a better-organized third DB. It should be parsed into a kernel-object table and a c-list table. Then the keys for each vatstore should be parsed and copied into tables for:
vom.${vref}
)vc.5.|schemata
)encodedKey
->marshalledValue
"forward" entries) (vc.5.${encodedKey}
)vc.5.|${vref}
)rc
entries)es
entries)ir
entries)Kernel Refcounts
The kvStore should be parsed into a kernel-object table, with columns for:
koid
(the object id kref)owner
: a vatID, or NULL for abandoned objectsreachable
andrecognizable
counts from the${kref}.refCount
entriesand also a c-list table:
o+
, 0 foro-
In general, the "reachable" refcount should equal the sum of the
exported=0 AND reachable=1
entries in the c-list table. The "recognizable" refcount is computed similarly. We ignore theexported=1
entries, because theirreachable
flag indicates whether the kernel can reach the exporting vat's object, so it doesn't count towards the refcount.Some objects might have a pinned refcount (via
controller.pinVatRoot()
), we might have elected to do this to the bootstrap vat's root object, but maybe not. Any other discrepancies might indicate a kernel GC problem.Vat Refcounts
In addition to parsing the kvStore into vatstore entries for each vat, and then parsing the vatstore entries into virtual-thing -specific tables, the tool should also scan the
vom
virtual-object state records for vrefs, and populate a table that says e.g. "virtual object o+d10/12 has a strong reference to vref o-3". This table would also have rows for collections which point to other vrefs, maybe also through a third table that captures both the collection's vref and the vref of the key (the implication being that both source vrefs must be alive to keep the target vref alive). And some variant of this for weak collections would be good.Then, the tool should add up the inbound strong references from both virtual-object state and collections, and compare that count against the
rc
entry. If they don't match, we've got a bug (perhaps #8756, perhaps a new one).It should also compare the
es
export status with the matching c-list entry for that vat. Ifes
claims the exported vref is reachable, the c-list should sayR
. Ifes
says merely-recognizable, it should say_
.Finally, it should check the recognizability records for consistency. Vrefs that are used as keys in weak collections will appear both in the collection entry rows (as an ordinal), and there will be an
ir
record mapping from the vref to the collection ID. If we see one without the other, we've got some other kind of bug.The text was updated successfully, but these errors were encountered: