bundleStore: break bundles into deduplicated modules #9522
Labels
enhancement
New feature or request
performance
Performance related issues
swing-store
SwingSet
package: SwingSet
What is the Problem Being Solved?
#9517 describes a desire to improve the "install contract bundle to chain" pathway. Currently we install each bundle in a single (signed) cosmos message, containing its entire contents, even though most new bundles have significant overlap with bundles that already exist on the chain (eg most of
@endo
, ERTP, Nat, a lot of support code). We could shrink the cosmos messages considerably if we could install only the portions that were not already on the chain.Bundles consist of a top-level "compartment map" (a medium-size JSON-serialized record), and a set of "modules" (each with a name, and containing JS code). The compartment map contains hashes of the module contents. The hash of the compartment map is used to build the BundleID.
The module hashes gives us an easy way to de-duplicate their contents. Instead of sending the entire bundle to the chain, we could send only the modules that are not already installed, plus the new compartment map.
This will require support from swing-store (to track the modules and compartment maps), the kernel (to provide methods that query and install modules and compartment maps), cosmic-swingset (to accept installation messages, and expose module-is-present query mechanisms), and agd (to add a new
agd publish-bundle
command that interacts with the chain to figure out what modules need to be installed, and create the necessary signed messages).This ticket is focussed on the swing-store changes.
Description of the Design
Currently, swing-store contains a
bundleStore
component, which manages a single table namedbundles
. This is a simple map fromBundleID
(egb1-039c67a6e86acfc64c3d8bce9a8a824d5674eda18680f8ecf071c59724798945b086c4bac2a6065bed7f79ca6ccc23e8f4a464dfe18f2e8eaa9719327107f15b
) to the bundle contents. Specifically, it contains the.zip
-formatted data which is Base64-encoded to supply the.endoZipBase64
property of the bundle object, whosemoduleFormat:
property is "endoZipBase64
". The bundleStore's.getBundle(id)
method returns a bundle object like{ moduleFormat, endoZipBase64, endoZipBase64Sha512 }
. (bundleStore also supports older/simpler formats, but those cannot be broken down into modules very conveniently, and are only used for bootstrap purposes, so we ignore them).The plan is to perform an #8089 -type SQLite schema upgrade, which adds two new tables,
modules
andcompartmentMaps
. Themodules
table is indexed by a ModuleID (a hex-encoded SHA512 hash of the module contents, exactly as used incompartment-map.json
), and contains the compressed module contents (JS code). ThecompartmentMaps
table contains the BundleID (hex-encoded SHA512 hash ofcompartment-map.json
), along with the compressed contents of the compartment map. We might want an additional table to track the (many-to-many) ownership relationships between the two, and perhaps some refcount columns, so we can efficiently delete unused modules in the future.The existing
bundleStore.addBundle()
will still accept full bundles. When given ab1-
format bundle, it will:modules
, do an INSERTcompartmentMaps
rowThe existing
bundleStore.getBundle()
API will still return full bundles. When called, it will:@endo/bundle-source
zipfile generator to create a bundle ArrayBufferendoZipBase64
{ moduleFormat, endoZipBase64, endoZipBase64Sha512 }
)Some new APIs will be added:
bundleStore.addModule(moduleID, moduleContents)
: assert the hash is correct, then INSERT a row intomodules
bundleStore.addCompartmentMap(bundleID, compartmentMap)
: assert the hash is correct, enumerate the required ModuleIDs, assert that all are present inmodules
, then INSERT a row intocompartmentMaps
The existing
bundleStore.deleteBundle()
API will still work, however it is only guaranteed to delete the compartment map. Modules cannot be deleted if there are any remaining bundles using those modules. We can build a SQL query to learn the list of module IDs which no longer have owners, and delete them, however we should avoid accidentally deleting modules that are being added in anticipation of an upcomingaddCompartmentMap
callSecurity Considerations
Modules and compartment maps are hashed just as thoroughly as before, in fact the new tables are more strict, because they will reject corrupted modules.
If a second-preimage attack were found against SHA-512, an attacker could pre-submit an evil module variant that matches the hash of the good variant, and the deduplicated upload process would prefer the attacker's version, effectively allowing an attacker to corrupt other users' contract/vat bundles. However we are already vulnerable to such an attack, on the BundleIDs (compartment map contents), so there is no new threat here.
Size limits or installation fees should still be applied on the individual modules, to ensure this does not open up new opportunities for abuse.
It would be great to apply a filter on
addModule()
, to enforce that the module contents are legal JavaScript code, to limit abuse.Performance / Scaling Considerations
The current design does not compress anything about bundles or modules. We use a narrow subset of the
zip
format, which disables all compression. The bytes produced bybase64decode(bundle.endoZipBase64)
contain thezip
header followed by the literal bytes of the compartment-map.json and the modules.The deduplication of modules provides one form of disk-space savings. The compression of modules (and compartment maps) provides an additional form. We probably won't get as much compression savings as we would get by compressing entire bundles, because the compression context is limited to one file at a time, but we expect the deduplication to make up for this.
Installing bundles/modules will be slower, because the bundleStore is doing more work, and is more aware of the format. However installation doesn't happen very frequently.
getBundle
will be slower, because it needs to re-synthesize the bundle contents out of the individual modules. We might want to add a small LRU cache of synthesized bundles, to improve performance for things like ZCF (which are retrieved more frequently). Also, installing a new compartment map is likely to be followed shortly by retrieving the bundle, so we might count installation as "use" in the LRU cache. This cache might be persistent (kept in an additional SQLite table), however the cache is not part of consensus.Test Plan
Unit tests in
packages/swing-store
.Upgrade
The #8089 swing-store upgrade handler will need to create the new tables, and transform the
bundleStore
table.Every
b1-
format bundle from the oldbundles
table would be translated into modules and compartment maps, stored in the new tables, and removed from the old one. This would probably look like:(assuming we make
deleteBundle
look for bundles in the old table too, at least during upgrade)Open Questions
The overall process might work in one of three ways:
We want to minimize failed chain transactions, so we think it is important to be able to query the chain for installed modules first (and not spend a whole signed transaction, plus fees, to ask this question). That suggests that we need an RPC query which can reveal the modules that are installed, or one which accepts a list of module IDs and reports back the ones which are missing, or which returns a Bloom filter that compresses the list of installed module IDs.
Approach B requires the
bundleStore
to have the notion of an "incomplete" bundle. The kernel's vat-admin service has something resembling this (you can ask it for a Promise that is only resolved once a given bundle becomes available), but I think it would be tricky to build this to track missing modules. So I want to use approach C.We'll need to design this API with the kernel's APIs and the cosmic-swingset/chain-level APIs in mind.
Tasks
The text was updated successfully, but these errors were encountered: