Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build external refcount checking tool #8759

Open
warner opened this issue Jan 17, 2024 · 3 comments
Open

build external refcount checking tool #8759

warner opened this issue Jan 17, 2024 · 3 comments
Assignees
Labels
liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jan 17, 2024

As mentioned in #8756 :

We could check the need for remediation by building an external refcount checker: a tool that walks the DB and looks at each vat's vatstore, finds references from virtual data, adds them up, and compares against the rc count in the DB. We can do the same for ir recognizer records and the keys of weak virtual collections. This bug would manifest as a difference between the two numbers.

The task for this ticket is to build that tool, and run it against the current mainnet database, to see if we have any problems. I suspect everything is fine, but it might show evidence of #8756, or of some other as-yet-discovered GC problem.

The tool should check both kernel and vat refcounts for consistency. The starting point should be a copy of the kvStore (just a new DB with CREATE TABLE kv (key STRING, value STRING) and a copy of the whole kvStore), which will be a lot smaller and easier to work with than the full swingstore.

Then I'd parse that data into a better-organized third DB. It should be parsed into a kernel-object table and a c-list table. Then the keys for each vatstore should be parsed and copied into tables for:

  • virtual-object Kind metadata
  • virtual-object state data (vom.${vref})
  • virtual-collection metadata (e.g. vc.5.|schemata)
  • virtual-collection entries (the encodedKey -> marshalledValue "forward" entries) (vc.5.${encodedKey})
  • virtual-collection ordinal entries (the vref -> ordinal "reverse" entries) (vc.5.|${vref})
  • refcounts (the rc entries)
  • export-statuses (es entries)
  • recognizer entries (ir entries)

Kernel Refcounts

The kvStore should be parsed into a kernel-object table, with columns for:

  • koid (the object id kref)
  • owner: a vatID, or NULL for abandoned objects
  • reachable and recognizable counts from the ${kref}.refCount entries

and also a c-list table:

  • vatID
  • kref
  • vref
  • exported: 1 for c-list entries whose vref is o+, 0 for o-
  • reachable flag (boolean)
  • recognizable flag (boolean)

In general, the "reachable" refcount should equal the sum of the exported=0 AND reachable=1 entries in the c-list table. The "recognizable" refcount is computed similarly. We ignore the exported=1 entries, because their reachable flag indicates whether the kernel can reach the exporting vat's object, so it doesn't count towards the refcount.

Some objects might have a pinned refcount (via controller.pinVatRoot()), we might have elected to do this to the bootstrap vat's root object, but maybe not. Any other discrepancies might indicate a kernel GC problem.

Vat Refcounts

In addition to parsing the kvStore into vatstore entries for each vat, and then parsing the vatstore entries into virtual-thing -specific tables, the tool should also scan the vom virtual-object state records for vrefs, and populate a table that says e.g. "virtual object o+d10/12 has a strong reference to vref o-3". This table would also have rows for collections which point to other vrefs, maybe also through a third table that captures both the collection's vref and the vref of the key (the implication being that both source vrefs must be alive to keep the target vref alive). And some variant of this for weak collections would be good.

Then, the tool should add up the inbound strong references from both virtual-object state and collections, and compare that count against the rc entry. If they don't match, we've got a bug (perhaps #8756, perhaps a new one).

It should also compare the es export status with the matching c-list entry for that vat. If es claims the exported vref is reachable, the c-list should say R. If es says merely-recognizable, it should say _.

Finally, it should check the recognizability records for consistency. Vrefs that are used as keys in weak collections will appear both in the collection entry rows (as an ordinal), and there will be an ir record mapping from the vref to the collection ID. If we see one without the other, we've got some other kind of bug.

@warner warner added SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Jan 17, 2024
@mhofman
Copy link
Member

mhofman commented Jan 17, 2024

If we're gonna do this, could we not also do a full trace to detect orphans and unreferenced cycles?

@warner
Copy link
Member Author

warner commented Jan 17, 2024

My immediate need is to tell whether #8756 needs remediation, or if we can simply deploy a fix and not worry about cleaning up old data, and that just need to look for refcount discrepancies.

This tool would certainly be the basis for a full mark-and-sweep trace, but I'd consider it extra credit, and I'd like whoever implements this to answer the remediation question before doing the extra work.

warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Remotable-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Remotable/Representative/Presence-style keys, consuming c-list
  entries longer than necessary.

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has() to incorrectly return true, and
  .init() to throw a "already registered" error, as well as consuming
  DB space longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 28, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
warner added a commit that referenced this issue Aug 29, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 30, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @mhofman and @gibson042 for recommendations.
warner added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of:

* fixes #8756
* fixes #7355
* fixes #9956

As a change to liveslots, full deployment requires both restarting the
kernel with this new code, *and* triggering a vat upgrade of all
vats. Once that is done, no new collection entries will suffer the
problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.

This commit marks the new tests as expected to pass again. It adds one
new (failing) test to demonstrate the lack of remediation code.

Thanks to @mhofman and @gibson042 for recommendations.
mergify bot added a commit that referenced this issue Aug 31, 2024
Liveslots was suffering from the following bugs:

* vrefs weren't properly encoded into DB keys and "encodePassable"
  keys when deleting collections (such that `isEncodedRemotable()`
  always answered false, which skipped the refcount processing of
  keys, and removal of the "ordinal-assignment" record)

* recognition records (vom.ir. keys) weren't created or removed for
  Remotable-style keys in weak collections

* Presence-style vrefs were not submitted for retirement processing
  when deleting a weak collection

As a result:

* `collection.clear()` left the "ordinal-assignment"` records in the
  vatstore, causing subsequent .has(oldkey) to incorrectly return
  true, and .init(oldkey, newvalue) to throw a "already registered"
  error, as well as consuming DB space longer than necessary.

* Invoking the `.clear()` method on a virtual/durable collection, or
  dereferencing the collection itself (and allowing it to be garbage
  collected), did not free any Remotables, virtual/durable
  objects (Representatives), or imported Presences used as keys. This
  could cause data in other weak collections (or other vats) to be
  retained longer than necessary.

* retiring (deleting) a Remotable used as a key in a weak
  virtual/durable collection did not free the corresponding value,
  causing data (in local and/or remote vats) to be retained longer
  than necessary

* Allowing a weak virtual/durable collection to be garbage collected
  did not inform the kernel that the vat can no longer recognize the
  Presence-style keys, consuming c-list entries longer than necessary.

This commit fixes those bugs, which fixes the immediate cause of
issues #8756 , #7355 , and #9956 . As a change to liveslots, full
deployment requires both restarting the kernel with this new
code, *and* triggering a vat upgrade of all vats. Once that is done,
no new collection entries will suffer the problems listed above.

However, this commit makes no attempt to remediate any existing data
corruption. See #8759 for plans to build a tool that can audit the
virtual-data reference graph and detect the consequences of these bugs
in pre-existing kernel databases, and see the issues listed above for
notes on efforts to build remediation tools.
@warner
Copy link
Member Author

warner commented Sep 3, 2024

Talking with @siarhei-agoric , we decided that many modes of this tool do not really need a separate intermediate DB. For example, auditing for refcount consistency within a single vat can do:

  • walk all virtual-collection and virtual-object rows of the vatstore
  • accumulate refcounts in a RAM-based JS Map
  • once the walk is complete, iterate through the Map and check the refcount for each
  • also walk the refcount section of the vatstore and check map.has() for each, to look for stranded refcount rows

That requires RAM in proportion to the number of vrefs in use by the vat, perhaps 1-10M for the larger ones, but does not require storing full details about each edge.

If the tool detects a mismatch, we could re-run it in a verbose mode, with a specific set of vrefs to look for (repeating the full walk to gather more data), but with luck it will report "all refcounts match" and we don't need to spend the extra effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

3 participants