Reduce main module size #1574

Y-- · 2024-01-08T22:19:24Z

Note: This PR is more a PoC to start a conversation: it is not in a mergeable state.

Today, when building DuckDB WASM with the "loadable extension" flag, the bundles are quite big because when building with MAIN_MODULE=1, Emscripten exports all absolutely the symbols:

Bundle	Uncompressed	Brotli	GZip
duckdb-eh.js	45M	1.4M	2.0M
duckdb-eh.wasm	32M	4.1M	6.2M

This PR is an attempt to reduce these. Here are the size when built with this PR:

Bundle	Uncompressed	Brotli	GZip
duckdb-eh.js	2.7M	193K	250K
duckdb-eh.wasm	19M	2.7M	4.2M

I was able to load all the in-tree DuckDB extensions, as well as ours with this build.

Of course, this is not that easy, here are the caveats:

I had to lower optimization level from O2 to O3 cf. Issue with O3-optimized main module and side module using emval emscripten-core/emscripten#21036
the script to list symbols to export (to generate the exported_functions file) is more akin to a magic incantation than a robust solution
the list is dependent on all the extensions that might be loaded

Happy to hear your thoughts and whether it is a road worth pursuing, or if you have better ideas on how to solve this?

carlopi · 2024-01-09T10:11:42Z

Hi! Thanks a lot for bringing this up, and the exploration you did.

I think this is a bit too brittle, but if needed this can be considered as a bridge (potentially we might need a different list for each bundle, since functions might end up having different signatures, plus there are ad hoc functions like the personality function that do are missing if exceptions are not there, see https://github.com/duckdb/duckdb-wasm/actions/runs/7453872285/job/20280188337?pr=1574#step:7:5872).

The problem is partially that the functions are exported, but mainly the fact that for each exported functions it's name (as a string) it's repeated multiple times:
1 time in the wasm file, see the output of wasm2wat duckdb-eh.wasm --enable-all | grep export:

  (export "_ZNSt3__218__allocation_guardINS_9allocatorINS_20__shared_ptr_emplaceIN5arrow12NumericArrayINS3_10Date64TypeEEENS1_IS6_EEEEEEEC2B8ue170004IS7_EET_m" (func 185))
  (export "_ZNSt3__220__shared_ptr_emplaceIN5arrow12NumericArrayINS1_10Date64TypeEEENS_9allocatorIS4_EEEC2B8ue170004IJxNS_10shared_ptrINS1_6BufferEEERKSB_xxEEES6_DpOT_" (func 193))

that in my versions generates 80000+ lines.
Then each symbol (the actual string, that might look like __ZN5arrow7struct_ERKNSt3__26vectorINS0_10shared_ptrINS_5FieldEEENS0_9allocatorIS4_EEEE, is repeated 5 more times in the connected JavaScript file. (EDIT: 5 times is for functions, there are also globals in the exports, will need to check better but I would assume a somewhat similar number)

Here I think there are few possible solutions:

a proper solution would be for the JS code to avoid the duplication entirely, and just add code on the fly based on the content of the wasm file (that is available, given it's loaded by the JavaScript file itself). This would solve the JS overhead (that is the most significant one). Proper place for this to be done would be in the dlopen Emscripten implementation and seems a bit patchy to do this in post-processing. Potentially a single redefinition of the variable might be needed, but this would cut at least 75% of the overhead (possibly 100%, unsure if the variable definition needs to be there).
another solution would be having a way to remap function names (that are long strings of characters) to a shortened version, like _ZNSt3__220__shared_ptr_emplaceIN5arrow12NumericArrayINS1_10Date64TypeEEENS_9allocatorIS4_EEEC2B8ue170004IJxNS_10shared_ptrINS1_6BufferEEERKSB_xxEEES6_DpOT_ gets mapped to f12445, and everywhere there is the first string you put the second one. This would keep 1 reference in the wasm file and 4 in the js files, but those names will be significantly shorter. (to go up to 80K names you need strings of length 5 + 1 character to start and allow easier pattern matching). Average length of names is about 105 for now, so gains will be like 90%+ of the overhead.

Main problem of this remapping is that the original names still needs to be stored somewhere.
Simple solution would be storing the mapping as part of the JavaScript code (this reduces savings from 90%+ to 75% on JS side, and keeps them Wasm side). This is probably the simpler version I would like to start experimenting with.
Somehow more involved on the infrastructure side is keeping the mapping elsewhere (basically a property of duckdb), and have this mapping be used also to produce extensions, so the mapping becomes needed only at build-time (either for duckdb-wasm or for extensions).

domoritz · 2024-01-09T10:51:34Z

As you look at file sizes, please also measure gzip and brotli compressed sizes. We should track that as you make progress.

carlopi · 2024-01-09T13:42:10Z

Actually I think solution 1 (= iterating on available exports) should be doable, for now I implemented only a prototype but it seems that the whole overhead on JS files can be avoided (adding a post-processing step).

Wasm side the exports are still somehow pricey, so maybe both solution can be combined.

I have not explored changing optimization level, also that can be done independenlty.

Y-- · 2024-01-09T14:29:31Z

Thanks for the feedback!

Agree that solution 1 sounds the most promising (and the cleanest one). I actually tried to hack something quickly before the holiday break but couldn't come up with something obvious. I agree that this seems very doable indeed, happy to help on that front.

Solution 2 is basically what closure=1 is supposed to do, and while it works very well on the small examples I tried, I faced some errors with duckdb-wasm (though didn't investigate much time looking what the problem was exactly). In any case I think we should use it once we have exhausted solution 1.

And finally completely agree with @domoritz that we should also track the gzipped and brotli-compressed size. I actually almost provided the brotli size in the PR description but felt that while important, it is not the main point: these symbols are still an overhead for the browser even if bundle is compressed. I'll edit the description with them though :-)

@carlopi, let me know how I can best help you with solution 1.

carlopi · 2024-01-17T15:47:02Z

I found a solution to the size problem, I will open a PR soon (EDIT: it's here #1589), basics are that JavaScript size we don't need to expose any C++ side functions (apart for the few duckdb_web ones and some utilities listed as EXPORTED_FUNCTIONS).

I have not find a polished way to do this, butstripping them via awk from the JS side will be enough for now.

This solutions allows to reduce the size of the uncompressed duckdb-eh.js file to 2.0M, while the wasm files will not be modified (since the overhead on listing the exports will still be there).

The general problem is that Emscripten seems not to have (or at least, not one that I have found) a way to specify that there are two level of interfaces: symbols that needs to be externally available at the C++ side to other modules to be dlopened, and symbols that needs to be exposed to the JavaScript interface (for functions that needs to be there on the JS side).

I hope a solution can be implemented properly in Emscripten.

This PR is still interesting in the blue-print to select explicitly which functions are to be exported (C++ side), that allows to reduce the weight of the wasm side, so it might make sense to have both since they are independent, I will come back to this after merging the JS-side PR.

Y-- · 2024-01-29T15:10:23Z

Closing since #1589 is a much better approach.

Reduce main module size

619dca7

carlopi mentioned this pull request Jan 9, 2024

JSDelivr: Package size exceeded the configured limit of 150 MB. #1561

Closed

carlopi mentioned this pull request Jan 18, 2024

Track file sizes + set up CI to fail if npm package size > 150MB #1592

Closed

2 tasks

Y-- closed this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce main module size #1574

Reduce main module size #1574

Y-- commented Jan 8, 2024 •

edited

Loading

carlopi commented Jan 9, 2024 •

edited

Loading

domoritz commented Jan 9, 2024

carlopi commented Jan 9, 2024

Y-- commented Jan 9, 2024

carlopi commented Jan 17, 2024 •

edited

Loading

Y-- commented Jan 29, 2024

Reduce main module size #1574

Reduce main module size #1574

Conversation

Y-- commented Jan 8, 2024 • edited Loading

carlopi commented Jan 9, 2024 • edited Loading

domoritz commented Jan 9, 2024

carlopi commented Jan 9, 2024

Y-- commented Jan 9, 2024

carlopi commented Jan 17, 2024 • edited Loading

Y-- commented Jan 29, 2024

Y-- commented Jan 8, 2024 •

edited

Loading

carlopi commented Jan 9, 2024 •

edited

Loading

carlopi commented Jan 17, 2024 •

edited

Loading