Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundleStore: break bundles into deduplicated modules #9522

Open
1 task
warner opened this issue Jun 17, 2024 · 1 comment
Open
1 task

bundleStore: break bundles into deduplicated modules #9522

warner opened this issue Jun 17, 2024 · 1 comment
Labels
enhancement New feature or request performance Performance related issues swing-store SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jun 17, 2024

What is the Problem Being Solved?

#9517 describes a desire to improve the "install contract bundle to chain" pathway. Currently we install each bundle in a single (signed) cosmos message, containing its entire contents, even though most new bundles have significant overlap with bundles that already exist on the chain (eg most of @endo, ERTP, Nat, a lot of support code). We could shrink the cosmos messages considerably if we could install only the portions that were not already on the chain.

Bundles consist of a top-level "compartment map" (a medium-size JSON-serialized record), and a set of "modules" (each with a name, and containing JS code). The compartment map contains hashes of the module contents. The hash of the compartment map is used to build the BundleID.

The module hashes gives us an easy way to de-duplicate their contents. Instead of sending the entire bundle to the chain, we could send only the modules that are not already installed, plus the new compartment map.

This will require support from swing-store (to track the modules and compartment maps), the kernel (to provide methods that query and install modules and compartment maps), cosmic-swingset (to accept installation messages, and expose module-is-present query mechanisms), and agd (to add a new agd publish-bundle command that interacts with the chain to figure out what modules need to be installed, and create the necessary signed messages).

This ticket is focussed on the swing-store changes.

Description of the Design

Currently, swing-store contains a bundleStore component, which manages a single table named bundles. This is a simple map from BundleID (eg b1-039c67a6e86acfc64c3d8bce9a8a824d5674eda18680f8ecf071c59724798945b086c4bac2a6065bed7f79ca6ccc23e8f4a464dfe18f2e8eaa9719327107f15b) to the bundle contents. Specifically, it contains the .zip -formatted data which is Base64-encoded to supply the .endoZipBase64 property of the bundle object, whose moduleFormat: property is "endoZipBase64". The bundleStore's .getBundle(id) method returns a bundle object like { moduleFormat, endoZipBase64, endoZipBase64Sha512 }. (bundleStore also supports older/simpler formats, but those cannot be broken down into modules very conveniently, and are only used for bootstrap purposes, so we ignore them).

The plan is to perform an #8089 -type SQLite schema upgrade, which adds two new tables, modules and compartmentMaps. The modules table is indexed by a ModuleID (a hex-encoded SHA512 hash of the module contents, exactly as used in compartment-map.json), and contains the compressed module contents (JS code). The compartmentMaps table contains the BundleID (hex-encoded SHA512 hash of compartment-map.json), along with the compressed contents of the compartment map. We might want an additional table to track the (many-to-many) ownership relationships between the two, and perhaps some refcount columns, so we can efficiently delete unused modules in the future.

The existing bundleStore.addBundle() will still accept full bundles. When given a b1- format bundle, it will:

  • assert the compartment map hash is correct
  • iterate through the compartment map, examining each module in turn
    • assert the module hash is correct
    • if the ModuleID is not already present in modules, do an INSERT
  • INSERT the compartmentMaps row

The existing bundleStore.getBundle() API will still return full bundles. When called, it will:

  • fetch and decompress the compartment map, indexed by the BundleID
  • parse the compartment map to learn the list of ModulesIDs
  • fetch and decompress each module
  • use the @endo/bundle-source zipfile generator to create a bundle ArrayBuffer
  • base64-encode that to create endoZipBase64
  • build and return the bundle object ({ moduleFormat, endoZipBase64, endoZipBase64Sha512 })

Some new APIs will be added:

  • bundleStore.addModule(moduleID, moduleContents): assert the hash is correct, then INSERT a row into modules
  • bundleStore.addCompartmentMap(bundleID, compartmentMap): assert the hash is correct, enumerate the required ModuleIDs, assert that all are present in modules, then INSERT a row into compartmentMaps
  • something to reveal the list of available modules, or to ask about what modules would be required to install a given compartment map

The existing bundleStore.deleteBundle() API will still work, however it is only guaranteed to delete the compartment map. Modules cannot be deleted if there are any remaining bundles using those modules. We can build a SQL query to learn the list of module IDs which no longer have owners, and delete them, however we should avoid accidentally deleting modules that are being added in anticipation of an upcoming addCompartmentMap call

Security Considerations

Modules and compartment maps are hashed just as thoroughly as before, in fact the new tables are more strict, because they will reject corrupted modules.

If a second-preimage attack were found against SHA-512, an attacker could pre-submit an evil module variant that matches the hash of the good variant, and the deduplicated upload process would prefer the attacker's version, effectively allowing an attacker to corrupt other users' contract/vat bundles. However we are already vulnerable to such an attack, on the BundleIDs (compartment map contents), so there is no new threat here.

Size limits or installation fees should still be applied on the individual modules, to ensure this does not open up new opportunities for abuse.

It would be great to apply a filter on addModule(), to enforce that the module contents are legal JavaScript code, to limit abuse.

Performance / Scaling Considerations

The current design does not compress anything about bundles or modules. We use a narrow subset of the zip format, which disables all compression. The bytes produced by base64decode(bundle.endoZipBase64) contain the zip header followed by the literal bytes of the compartment-map.json and the modules.

The deduplication of modules provides one form of disk-space savings. The compression of modules (and compartment maps) provides an additional form. We probably won't get as much compression savings as we would get by compressing entire bundles, because the compression context is limited to one file at a time, but we expect the deduplication to make up for this.

Installing bundles/modules will be slower, because the bundleStore is doing more work, and is more aware of the format. However installation doesn't happen very frequently.

getBundle will be slower, because it needs to re-synthesize the bundle contents out of the individual modules. We might want to add a small LRU cache of synthesized bundles, to improve performance for things like ZCF (which are retrieved more frequently). Also, installing a new compartment map is likely to be followed shortly by retrieving the bundle, so we might count installation as "use" in the LRU cache. This cache might be persistent (kept in an additional SQLite table), however the cache is not part of consensus.

Test Plan

Unit tests in packages/swing-store.

Upgrade

The #8089 swing-store upgrade handler will need to create the new tables, and transform the bundleStore table.

Every b1- format bundle from the old bundles table would be translated into modules and compartment maps, stored in the new tables, and removed from the old one. This would probably look like:

for (const id of bundleStore.getBundleIDs()) {
  if (id.startsWith('b1-')) {
    const bundle = bundleStore.getBundle(id);
    bundleStore.deleteBundle(id);
    bundleStore.addBundle(id, bundle);
  }
}

(assuming we make deleteBundle look for bundles in the old table too, at least during upgrade)

Open Questions

The overall process might work in one of three ways:

  • A:
    • try to install the compartment map, fail
    • the error tells you which modules are missing
    • install those modules
    • try to install the compartment map, succeed
    • now the bundle is ready to use
  • B:
    • install the compartment map, it works but the bundle is marked as incomplete, the result tells you which modules are missing
    • install those modules
    • now the bundle is ready to use
  • C:
    • query the store for the missing modules
    • install those modules
    • install the compartment map, succeed
    • now the bundle is ready to use

We want to minimize failed chain transactions, so we think it is important to be able to query the chain for installed modules first (and not spend a whole signed transaction, plus fees, to ask this question). That suggests that we need an RPC query which can reveal the modules that are installed, or one which accepts a list of module IDs and reports back the ones which are missing, or which returns a Bloom filter that compresses the list of installed module IDs.

Approach B requires the bundleStore to have the notion of an "incomplete" bundle. The kernel's vat-admin service has something resembling this (you can ask it for a Promise that is only resolved once a given bundle becomes available), but I think it would be tricky to build this to track missing modules. So I want to use approach C.

We'll need to design this API with the kernel's APIs and the cosmic-swingset/chain-level APIs in mind.

Tasks

Preview Give feedback
  1. SwingSet enhancement swing-store
@warner
Copy link
Member Author

warner commented Jun 18, 2024

We should probably update the export format to include both modules and compartment maps. This would export each module as its own artifact, and each compartment map as its own artifact. The importer would want to import all the modules first, then import the compartment map.

The trickiest part is what we do with the export-data. The new format wants to have each module and compartment map get its own export-data records, but the pre-existing export-data has one record per bundle. Maybe we retain the bundle records (which just have the bundleID hash) and think of them as compartment-map records, and emit new records for the modules at upgrade time.

If we wanted to avoid this work, we might be able to get away with only exporting/importing complete bundles. Our exports would be larger (they would contain duplicate modules), but that's probably not too significant.

The #8089 discussions about how a schema version should be exposed in an import are quite relevant, even if we stick to exporting/importing complete bundles. In particular, if someone installs a module and then fails to install a compartment map that references it, and then the state is copied to a new node via export, the original and the copy would react differently to an installCompartmentMap() that references the hitherto-orphan module, if the export process didn't faithfully reproduce the individual modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance related issues swing-store SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

1 participant