Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialized wasmtime modules are extremely large #2318

Closed
whitequark opened this issue Oct 25, 2020 · 18 comments
Closed

Serialized wasmtime modules are extremely large #2318

whitequark opened this issue Oct 25, 2020 · 18 comments

Comments

@whitequark
Copy link
Contributor

whitequark commented Oct 25, 2020

I'm using custom module serialization based caching in YoWASP, primarily to be able to show a message during cold startup (which can take many minutes on slow machines), but also to avoid applying zstd compression that noticeably increases warm startup latency.

The problem is that the cached modules are extremely large—so large that it causes a near-pathological amount of disk I/O. (On Windows VMs, due to the I/O, warm startup actually becomes slower compared to wasmtime's own caching. On bare metal Windows on SSDs it still gets faster.)

On Linux, yosys.wasm (21M wasm file) results in a 290M serialized module, an increase of ~14×. On Windows, for some reason, that very same file results in a ~580M serialized module, an increase of ~28×. The native Yosys binary for x86_64 is 17M; even once you account for libc and libstdc++, it really should not take over half a gigabyte of space.

I'm filing this issue on request of @alexcrichton here.

@whitequark
Copy link
Contributor Author

Oh and the script that runs it is here. It's extremely simple but maybe I forgot something important.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 26, 2020
This commit compresses `FunctionAddressMap` by performing a simple
coalescing of adjacent `InstructionAddressMap` descriptors if they
describe the same source location. This is intended to handle the common
case where a sequene of machine instructions describes a high-level wasm
instruction.

For the module on bytecodealliance#2318 this reduces the cache entry size from 306MB to
161MB.
@alexcrichton
Copy link
Member

The cache entry size is effectively the bincode-coded size of this structure, which I believe is roughly analagous to the in-memory size of that structure. Adding some simple instrumentation I was able to get the encoded byte size of each of the fields:

module:                  152817        
obj:                   56236896        
funcs:                242146780        
  funcs(traps):                24182940
  funcs(address_map):         217774640
  funcs(stack_maps):             189216
data_initializers:      2880584        
unwind_info:            4611040        
total:                306028126        

clearly funcs is the massive field (too much so), and the worst offenders are the traps and address_maps maps. Neither of these maps really justifies taking up so much space, and longer-term we need to have a more serious refactoring to probably encode these entirely different in a much more compact fashion. The consumer, however, relies on random-access right now (or at least the ability to do a binary search), so it won't be immediately trivial to restructure these data structures. In the meantime #2321 should be a band-aid for now to make it a bit more reasonable (although still pretty unreasonable).

I've yet to investigate the Windows discrepancy here, that's quite worrisome too!

@alexcrichton
Copy link
Member

Oh and after that PR, the breakdown looks like:

module:                  152817        
obj:                   56236896        
funcs:                 97596780        
  funcs(traps):                24182940
  funcs(address_map):          73224640
  funcs(stack_maps):             189216
data_initializers:      2880584        
unwind_info:            4611040        
total:                161478126        

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 26, 2020
Update `Module::{serialize,deserialize}` to use variable-length integers
with `bincode` to make the output artifacts smaller. Locally this
reduces the size of bytecodealliance#2318 from 160 to 110 MB, a 30% decrease in size!
Deserialization performance is slightly slower, but seemingly within the
range of noise locally for me.
alexcrichton added a commit that referenced this issue Oct 26, 2020
Update `Module::{serialize,deserialize}` to use variable-length integers
with `bincode` to make the output artifacts smaller. Locally this
reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size!
Deserialization performance is slightly slower, but seemingly within the
range of noise locally for me.
alexcrichton added a commit that referenced this issue Oct 26, 2020
This commit compresses `FunctionAddressMap` by performing a simple
coalescing of adjacent `InstructionAddressMap` descriptors if they
describe the same source location. This is intended to handle the common
case where a sequene of machine instructions describes a high-level wasm
instruction.

For the module on #2318 this reduces the cache entry size from 306MB to
161MB.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 26, 2020
This commit reduces the size of `InstructionAddressMap` from 24 bytes to
8 bytes by dropping the `code_len` field and reducing `code_offset` to
`u32` instead of `usize`. The intention is to primarily make the
in-memory version take up less space, and the hunch is that the
`code_len` is largely not necessary since most entries in this map are
always adjacent to one another. The `code_len` field is now implied by
the `code_offset` field of the next entry in the map.

This isn't as big of an improvement to serialized module size as bytecodealliance#2321
or bytecodealliance#2322, primarily because of the switch to variable-length encoding.
Despite this though it shaves about 10MB off the encoded size of the
module from bytecodealliance#2318
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 27, 2020
This commit reduces the size of `InstructionAddressMap` from 24 bytes to
8 bytes by dropping the `code_len` field and reducing `code_offset` to
`u32` instead of `usize`. The intention is to primarily make the
in-memory version take up less space, and the hunch is that the
`code_len` is largely not necessary since most entries in this map are
always adjacent to one another. The `code_len` field is now implied by
the `code_offset` field of the next entry in the map.

This isn't as big of an improvement to serialized module size as bytecodealliance#2321
or bytecodealliance#2322, primarily because of the switch to variable-length encoding.
Despite this though it shaves about 10MB off the encoded size of the
module from bytecodealliance#2318
alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Oct 27, 2020
Added in c4e1022 I think the original
reason (which I'm not entirely knowledgeable of) may no longer be
applicable? In any case this is a significant difference on Windows from
other platforms because it makes loads/stores of wasm code have manual
checks instead of relying on the guard page, causing runtime and
compile-time slowdowns on Windows-only.

I originally rediscovered this when investigating bytecodealliance#2318 and saw that
both the compile time of the module in question and trap information
tables were much larger than they were on Linux. Removing this
Windows-specific configuration fixed the discrepancies and afterwards
Linux and Windows were basically the same.
@alexcrichton
Copy link
Member

Ok I've got one more linked improvement as well as #2326 which should account for the discrepancy on Windows.

sunfishcode pushed a commit that referenced this issue Oct 28, 2020
Added in c4e1022 I think the original
reason (which I'm not entirely knowledgeable of) may no longer be
applicable? In any case this is a significant difference on Windows from
other platforms because it makes loads/stores of wasm code have manual
checks instead of relying on the guard page, causing runtime and
compile-time slowdowns on Windows-only.

I originally rediscovered this when investigating #2318 and saw that
both the compile time of the module in question and trap information
tables were much larger than they were on Linux. Removing this
Windows-specific configuration fixed the discrepancies and afterwards
Linux and Windows were basically the same.
@alexcrichton
Copy link
Member

With current master plus #2324 the cache entry size of the original wasm module is down to 97MB, and the cache entry sizes should be roughly the same on Linux and Windows. @whitequark for your use cases is that still too large? It's still pretty large having a >4x size increase, but further changes may require more invasive tweaks than the low hanging fruit picked off so far.

@whitequark
Copy link
Contributor Author

for your use cases is that still too large?

Seems excellent to me. Any chance you can do a point release of wasmtime-py with the updated interpreter? I have to test on a fairly heterogenous set of machines so it'd be much easier if I didn't have to distribute patched wheels as well.

@alexcrichton
Copy link
Member

Sure! I'll wait until #2324 is in and then I'll do a release of wasmtime

@fitzgen
Copy link
Member

fitzgen commented Oct 30, 2020

I don't know how easy bincode is to extend, but all the address-keyed func data could probably compress really well if we used deltas instead of absolute values combined with a variable length integer encoding (e.g. LEB128).

@fitzgen
Copy link
Member

fitzgen commented Oct 30, 2020

(and we ensured they were sorted)

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Nov 3, 2020
This commit reduces the size of `InstructionAddressMap` from 24 bytes to
8 bytes by dropping the `code_len` field and reducing `code_offset` to
`u32` instead of `usize`. The intention is to primarily make the
in-memory version take up less space, and the hunch is that the
`code_len` is largely not necessary since most entries in this map are
always adjacent to one another. The `code_len` field is now implied by
the `code_offset` field of the next entry in the map.

This isn't as big of an improvement to serialized module size as bytecodealliance#2321
or bytecodealliance#2322, primarily because of the switch to variable-length encoding.
Despite this though it shaves about 10MB off the encoded size of the
module from bytecodealliance#2318
alexcrichton added a commit that referenced this issue Nov 3, 2020
This commit reduces the size of `InstructionAddressMap` from 24 bytes to
8 bytes by dropping the `code_len` field and reducing `code_offset` to
`u32` instead of `usize`. The intention is to primarily make the
in-memory version take up less space, and the hunch is that the
`code_len` is largely not necessary since most entries in this map are
always adjacent to one another. The `code_len` field is now implied by
the `code_offset` field of the next entry in the map.

This isn't as big of an improvement to serialized module size as #2321
or #2322, primarily because of the switch to variable-length encoding.
Despite this though it shaves about 10MB off the encoded size of the
module from #2318
@alexcrichton
Copy link
Member

@fitzgen turns out bincode has an option for just that!

@alexcrichton
Copy link
Member

Ok wasmtime-py 0.21.0 is now published @whitequark if you'd be able to test it out

@whitequark
Copy link
Contributor Author

I'll take a look over the next few days!

@whitequark
Copy link
Contributor Author

Were there any breaking changes? It's a bit unfortunate that I'll have to rebuild all the packages as that makes it a bit harder to get other people to compare several different versions.

@whitequark
Copy link
Contributor Author

I've rebuilt yowasp-yosys and yowasp-nextpnr-ice40 with wasmtime 0.21.0 on my Linux machine.

  • Yosys: 291M→94M cache file size improvement, ~2x warm startup latency improvement (wow!!);
  • nextpnr-ice40: 23M→6.9M cache file size improvement, warm startup latency changes within noise threshold.

@alexcrichton
Copy link
Member

Right now our release process for wasmtime is just "major bump everything", and we're then trying to keep the external packages like wasmtime-py and wasmtime-go in sync with the main repo. We should probably allocate space to have minor version bumps of everything as well though...

Glad to hear the results as well!

@whitequark
Copy link
Contributor Author

Ah ok if it's in sync with the main repo then it might actually be pretty handy. I wasn't sure exactly how it worked.

@whitequark
Copy link
Contributor Author

@alexcrichton I've tried running yowasp-yosys and yowasp-nextpnr-ice40 on the cheapest and slowest laptop I found nearby, and it's only something like 2× slower on both cold and warm startup than my XPS 13. I have no idea how you managed to achieve that, and I'm quite impressed. As far as I'm concerned this is enough performance.

@alexcrichton
Copy link
Member

Nice! Thanks again for the report and helping to investigate!

cfallin pushed a commit that referenced this issue Nov 30, 2020
Update `Module::{serialize,deserialize}` to use variable-length integers
with `bincode` to make the output artifacts smaller. Locally this
reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size!
Deserialization performance is slightly slower, but seemingly within the
range of noise locally for me.
cfallin pushed a commit that referenced this issue Nov 30, 2020
This commit compresses `FunctionAddressMap` by performing a simple
coalescing of adjacent `InstructionAddressMap` descriptors if they
describe the same source location. This is intended to handle the common
case where a sequene of machine instructions describes a high-level wasm
instruction.

For the module on #2318 this reduces the cache entry size from 306MB to
161MB.
cfallin pushed a commit that referenced this issue Nov 30, 2020
Added in c4e1022 I think the original
reason (which I'm not entirely knowledgeable of) may no longer be
applicable? In any case this is a significant difference on Windows from
other platforms because it makes loads/stores of wasm code have manual
checks instead of relying on the guard page, causing runtime and
compile-time slowdowns on Windows-only.

I originally rediscovered this when investigating #2318 and saw that
both the compile time of the module in question and trap information
tables were much larger than they were on Linux. Removing this
Windows-specific configuration fixed the discrepancies and afterwards
Linux and Windows were basically the same.
cfallin pushed a commit that referenced this issue Nov 30, 2020
This commit reduces the size of `InstructionAddressMap` from 24 bytes to
8 bytes by dropping the `code_len` field and reducing `code_offset` to
`u32` instead of `usize`. The intention is to primarily make the
in-memory version take up less space, and the hunch is that the
`code_len` is largely not necessary since most entries in this map are
always adjacent to one another. The `code_len` field is now implied by
the `code_offset` field of the next entry in the map.

This isn't as big of an improvement to serialized module size as #2321
or #2322, primarily because of the switch to variable-length encoding.
Despite this though it shaves about 10MB off the encoded size of the
module from #2318
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants