Skip to content

Conversation

eqrion
Copy link
Contributor

@eqrion eqrion commented Aug 29, 2025

Partially fixes #154. A full fix would target some JS files. Opening to get early thoughts on it, there are some integration questions.

  1. JetStreamDriver.js is extended to decompress .z files using zlib during prefetch. If prefetch is disabled, these files are still prefetched to ensure the decompression time is outside of the score. In the browser this uses DecompressionStream. In the shell this uses the zlib-wasm code to decompress the file.
  2. A compress.py script is added that finds all wasm files and compresses them with zlib and then removes the originals.

Open questions:

  1. Is it worth using something other than zlib? We need to support the shell, and I didn't want to vendor in a new library just for this.
  2. Should we keep all the original uncompressed files? This patch doesn't, but instead the compress.py script can automatically decompress all the files in the tree for anyone who wants to read the build artifacts.
  3. Should the compression happen in each individual build script or one central file for the repo? I sort of liked having it in one file because then I could implement automatic decompression for all builds easily. But it adds an extra step when building that might not be obvious.

Copy link

netlify bot commented Aug 29, 2025

Deploy Preview for webkit-jetstream-preview ready!

Name Link
🔨 Latest commit c3dce11
🔍 Latest deploy log https://app.netlify.com/projects/webkit-jetstream-preview/deploys/68b1d99af7234800085a66dd
😎 Deploy Preview https://deploy-preview-170--webkit-jetstream-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@camillobruni
Copy link
Contributor

  • We should go with a npm run script for decompression just for consistency with the rest
  • I'd be fine with the .z files given the easy way to get to the decompressed .wasm files
  • Not sure how folks feel about --no-prefetch and wasm in this case (at least for JS I'd want to have the uncompressed source files there so I can easily see the source file path in the raw profile), maybe we need to warn about this and just force manually run npm run decompress
  • I think you altered the code for &prefetchResources=false for JS-blobs in the browser – we should keep on using the raw sources there (see New Workload: prismjs source code highlighting #149 for an example)

@danleh
Copy link
Contributor

danleh commented Sep 1, 2025

Very cool! I'll leave some detailed comments on the PR next, but responding first to your question 1:

Is it worth using something other than zlib?

No, I don't think that's worth the hassle. Some quick data / experiment: I copied all .wasm files plus this list of "large input files" (including some model files from #148)

./transformersjs/build/models/Xenova/distilbert-base-uncased-finetuned-sst-2-english/onnx/model_uint8.onnx
./transformersjs/build/models/Xenova/whisper-tiny.en/onnx/decoder_model_merged_quantized.onnx
./transformersjs/build/models/Xenova/whisper-tiny.en/onnx/encoder_model_quantized.onnx
./transformersjs/build/models/Xenova/whisper-tiny.en/tokenizer.json
./transformersjs/build/models/Xenova/distilbert-base-uncased-finetuned-sst-2-english/tokenizer.json
./wasm/tfjs-model-coco-ssd.js
./wasm/tfjs-model-mobilenet-v1.js
./wasm/tfjs-model-mobilenet-v3.js
./wasm/tfjs-bundle.js
./wasm/tfjs-model-use.js
./wasm/dotnet/build-interp/wwwroot/_framework/icudt_no_CJK.dat
./wasm/dotnet/build-aot/wwwroot/_framework/icudt_no_CJK.dat
./wasm/dotnet/build-interp/wwwroot/_framework/icudt_CJK.dat
./wasm/dotnet/build-aot/wwwroot/_framework/icudt_CJK.dat
./wasm/dotnet/build-interp/wwwroot/_framework/icudt_EFIGS.dat
./wasm/dotnet/build-aot/wwwroot/_framework/icudt_EFIGS.dat
./SeaMonster/inspector-json-payload.js
./code-load/inspector-payload-minified.js

and compared different compression methods:

method size relative to uncompressed relative to zlib
uncompressed 243MiB 100% 181%
zlib (script from this PR, uses -6 by default IIUC) 134MiB 55% 100%
gzip -9 134MiB 55% 99.6%
zstd -19 127MiB 52% 94.7%

I don't think those small savings of a better algorithm / library are worth adding another dependency for. (And we also could no longer use CompressionStream in the browser, since that only seems to support DEFLATE and gzip [spec])

@danleh
Copy link
Contributor

danleh commented Sep 1, 2025

Regarding the other points (including Camillo's):

We should go with a npm run script for decompression just for consistency with the rest

+1 to staying in the JavaScript/npm ecosystem. Would be happy to provide a port / alternative to compress.py in JavaScript in a PR later today.

  1. Should we keep all the original uncompressed files?

One reason for this change was to make the repository smaller on disk (excluding .git/) for vendoring JetStream, so let's not keep the uncompressed files checked in. Also in particular for Wasm files or machine learning model weights, one cannot diff them conveniently anyway, e.g., when reviewing PRs here, so I don't see much value in keeping them. Having a simple script to uncompress sounds good enough.

  1. Should the compression happen in each individual build script or one central file for the repo?

I agree that it's convenient to have a single script to decompress everything (in particular given the next point by Camillo). But I would like the build scripts to be self-contained / a single step; otherwise I think it's easy to forget or at least annoying having to run another python3 compress.py (or npm run compress) command after each build, e.g., when updating a workload. The compress command could take a list of files as input (including glob patterns), e.g., npm compress **/*.{wasm,dat} in the build script, and npm decompress could use **/*.z as the pattern by default.

Not sure how folks feel about --no-prefetch and wasm in this case (at least for JS I'd want to have the uncompressed source files there so I can easily see the source file path in the raw profile), maybe we need to warn about this and just force manually run npm run decompress

Agreed; right now compression always forces blob URLs. How about disabling decompression and stripping .z from each path when prefetchResources=false / --no-prefetch is given (i.e., make compression and no-preload-mode mutually exclusive) and then add something like Disabling resource prefetching! Also, please run 'npm decompress' to provide all the uncompressed resources in case you see failing requests or missing files. to the warning in

console.warn("Disabling resource prefetching!");

danleh added a commit to danleh/JetStream that referenced this pull request Sep 1, 2025
Based on `compress.py` from WebKit#170, with some modifications:
- Can be run as `npm run compress` or simply `node compress.mjs`
- Uses best zlib compression setting.
- Takes arbitrary glob patterns for input files, defaulting to all .z files for decompression.
- Copies the file mode over to avoid spurious git diffs.
danleh added a commit to danleh/JetStream that referenced this pull request Sep 2, 2025
danleh added a commit to danleh/JetStream that referenced this pull request Sep 2, 2025

// If we aren't supposed to prefetch this and don't need to decompress it,
// then return code snippet that will load the url on-demand.
let compressed = isCompressed(url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const compressed = ...

return `load("${url}");`

if (this.requests.has(url)) {
return this.requests.get(url);
}

const contents = readFile(url);
let contents;
if (isCompressed(url)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use compressed from above.

help='Decompress all .z files in current directory and subdirectories')
parser.add_argument('--keep-input', action='store_true',
help='Keep input files after processing (default: remove input files)')
parser.add_argument('--directory', default='.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistency with find and the likes, I would have expected this to be a positional argument, i.e.,

parser.add_argument('directory', nargs='?', default='.', 
                    help='Directory to search for files (default: current directory)')


// Fallback for shell environments without TextDecoder. This only handles valid
// UTF-8, invalid buffers will lead to unexpected results.
function decodeUTF8(int8Array) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use the shared polyfill in #173 instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use the node script from #172 instead (which stays in the NPM ecosystem, uses best zlib compression ration, copied file mode over).

@@ -1161,7 +1273,7 @@ class GroupedBenchmark extends Benchmark {
await benchmark.prefetchResourcesForBrowser();
}

async retryPrefetchResourcesForBrowser() {
async retryjForBrowser() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: intended naming change?

@eqrion
Copy link
Contributor Author

eqrion commented Sep 2, 2025

Thanks for the reviews!

I like the idea of using node for the compression script, will use #172 once it has merged. And also the shared polyfill for TextDecoder.

* Not sure how folks feel about `--no-prefetch` and wasm in this case (at least for JS I'd want to have the uncompressed source files there so I can easily see the source file path in the raw profile), maybe we need to warn about this and just force manually run `npm run decompress`

Yeah that seems like a better path than just silently re-enabling prefetching for those files. I'll implement that.

    Should we keep all the original uncompressed files?

One reason for this change was to make the repository smaller on disk (excluding .git/) for vendoring JetStream, so let's not keep the uncompressed files checked in. Also in particular for Wasm files or machine learning model weights, one cannot diff them conveniently anyway, e.g., when reviewing PRs here, so I don't see much value in keeping them. Having a simple script to uncompress sounds good enough.

As long as the uncompressed files are not used by the default runner, it is fine for them to be checked in. I can exclude them when vendoring in the JS3 repo into Firefox, and only copy over the .z files. But also it does seem nice to only have one canonical version of things.

What might change this is if we wanted to compress JS files too (which can be diff'ed and inspected easily). From #154, there were three large JS files (excluding tfjs which is disabled) that could be good candidates for this:

12      ./web-tooling-benchmark/cli.js
12      ./web-tooling-benchmark/browser.js
12      ./RexBench/FlightPlanner/waypoints.js

How do folks feel about compressing JS too? If that's okay with folks, then we probably should keep the uncompressed versions around.

I agree that it's convenient to have a single script to decompress everything (in particular given the next point by Camillo). But I would like the build scripts to be self-contained / a single step; otherwise I think it's easy to forget or at least annoying having to run another python3 compress.py (or npm run compress) command after each build, e.g., when updating a workload. The compress command could take a list of files as input (including glob patterns), e.g., npm compress **/.{wasm,dat} in the build script, and npm decompress could use **/.z as the pattern by default.

That's fine with me too. I was just running out of time on Friday and wanted to have something quicker. Updating all the build scripts probably isn't too bad.

@camillobruni
Copy link
Contributor

Thanks for kicking this off 👍

+1 on compressing large JS files too – given that this would just work transparently with prefetching!
Some of my pending PRs do have indeed huge files.
If we add an npm run shell ... helper or so, we could even hide the decompression transparently – so that would be fine

@danleh
Copy link
Contributor

danleh commented Sep 3, 2025

#172 landed, so feel free to use / rebase this on top of it.

Also +1 to compress large JS files.

As discussed, that could still work without keeping the original / uncompressed files in the repo. Basically the default config would do preloading and decompression during that preloading, so no uncompressed files on disk required. And without preloading, we just rewrite the URLs/file loads to strip .z and 404 / error out if not present, thus requiring to run npm compress -- -d beforehand (optionally integrated into a single step with npm run shell or npm run server as Camillo proposed).

Edit: Re-reading/thinking about the arguments, I am not so sure about compressing source JS files any more. (JS files that are just like blobs, generated, and won't be manually modified, e.g., inputs for babel are fine to compress.) Keeping uncompressed JS source files around for diff/code review/maintenance sounds like a good idea. In terms of transfer size during loading, there won't be any benefit to compressing in the repo, since a competent web server will use some compression scheme anyway. E.g., the Netlify preview uses brotli (see screenshot, ~2MB vs ~12MB uncompressed for waypoints.js)

image

@eqrion
Copy link
Contributor Author

eqrion commented Sep 4, 2025

@danleh @camillobruni

Here's an alternative idea. What if we just left all of the files in this tree uncompressed, and only added support to JetStreamDriver for decompressing? It would then be up to anyone vendoring the tree to compress whatever files they want and rewrite the paths in the driver. I can have a script that does this as part of the mozilla vendoring process.

We wouldn't need to update any build scripts, or do anything for disablePrefetching+compression (we wouldn't be doing that on the vendored copy). I probably could drop all the shell polyfilling for zlib too because we only would be running the vendored copy in the browser. We'd also continue to get good diff's for free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compress large generated files and decompress them during preloading
3 participants