Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mmap'd *.cwasm as a source for memory initialization images #3787

Merged
merged 12 commits into from
Feb 10, 2022

Conversation

alexcrichton
Copy link
Member

This commit addresses #3758 and makes it possible to avoid memfd_create when loading a module from a precompiled binary stored on disk. In this situation we already mmap the file from disk, and we can use the same technique that memfd uses where a copy-on-write mapping is made whenever a module is instantiated. This measn that all Unix platforms, not only Linux, can benefit from copy-on-write so long as the module comes from a precompiled module on disk.

The first commit here is refactoring to enable this functionality on Linux. After the first commit we avoid creation of a memfd and instead map the raw underlying *.cwasm into memories. This involved moving the creation of the memory image to compile-time of Module rather than construction-time of Module, as well as aligning the data section to ensure it shows up at a page-offset boundary in the file (which is required by mmap). The second commit then enables this support to work on macOS which involved some #[cfg] followed by tweaking the madvise logic to instead blow away the mapping (no reuse on systems without madvise(DONTNEED) as there's no portable way to reset the CoW overlay)

I tried for a bit to get this working on Windows, but while I could get things to compile I don't believe the same technique we're using here for Unix works on Windows. Windows appears to reject mapping a file onto a pre-existing region allocated with VirtualAlloc, meaning that all attemps to map the file into memory have failed so far for me. This StackOverflow question seems to suggest that this may simply not work on Windows unless we use undocumented APIs. In any case the major benefit of this PR is avoiding extra file descriptors on Unix for modules created from files on disk, so while having Windows support would be nice it's not necessarily required.

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Feb 10, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments; I'll read this over and digest more tomorrow. But: I'm really excited to see this!

I'm obviously curious about benchmarks but I'm sure those are coming :-) Definitely would want to see effect on both instantiation and runtime perf (we want to make sure the mmap-of-disk vs mmap-of-memfd doesn't have any ill effects on CoW performance once the file is hot in pagecache).

One possibly subtle thing to note in Module API documentation or elsewhere: bad things will happen if the file on disk changes while the Module exists. This is already very much a bad idea prior to this PR, because we use the mmap for other things (JIT code! wasm_data!) too, but mapping the live Wasm heap data from the file brings the risk front-of-mind, and makes me think we should warn about it. It's not an unreasonable expectation that a "load a thing from a file" API would read the file once and then be done with it, otherwise.

A very conservative answer to the above risk would be a new Module constructor that is something like Module::map_file_direct(File), and then explicitly avoid the direct-mmap for the existing API that just takes a filename -- though I'm ambivalent about that as it would mean perf benefits are hidden in a non-default place that the average user might not find either.

crates/environ/src/module.rs Outdated Show resolved Hide resolved
crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved
crates/runtime/src/mmap_vec.rs Outdated Show resolved Hide resolved
@tschneidereit
Copy link
Member

A very conservative answer to the above risk would be a new Module constructor that is something like Module::map_file_direct(File), and then explicitly avoid the direct-mmap for the existing API that just takes a filename -- though I'm ambivalent about that as it would mean perf benefits are hidden in a non-default place that the average user might not find either.

Given that as you say this already is a bad idea, I think the downside you point out here weighs heavy. Is there any kind of actual safety downside to this relative to the current situation? If not, then we should either make the current behavior safer by default, or continue making use of the benefits we get from this approach. (I also think that it's fine to assume continued existence of files like this, with the argument being "don't shared libraries work this way, too" or something similarly handwavy in that direction.)

@alexcrichton
Copy link
Member Author

For performance numbers I actually naively assumed it wouldn't really matter here. I don't think our microbenchmarks will really show much today though since it's all just hitting the kernel's page cache. I don't know enough about mmap performance from files to know whether this is what we'd continue to see out in the wild though.

What I got though for "instantiate large module" is below. In this benchmark it was purely instantiation so only setting up vma mappings and not actually doing anything with them:

parallel/pooling/rustpython.wasm: with 1 thread
                        time:   [5.1166 us 5.1196 us 5.1226 us]

parallel/pooling/rustpython.wasm: with 1 thread (loaded from *.cwasm)
                        time:   [5.0117 us 5.0143 us 5.0170 us]

I then modified the benchmark to write 0, from Rust, to every single page of memory after instantiation succeeded and I got:

parallel/pooling/rustpython.wasm: with 1 thread
                        time:   [4.0414 ms 4.0418 ms 4.0422 ms]

parallel/pooling/rustpython.wasm: with 1 thread (loaded from *.cwasm)
                        time:   [4.0682 ms 4.0686 ms 4.0690 ms]

Running a few times this seemed consistent, although the "load all the pages" had some variance in time to the point that I think they were roughly equivalent.

The full diff for the benchmark was https://gist.github.com/alexcrichton/7e555a405f2815ec69fcac659b5d85de and the first numbers were collected without the changes to the instantiate function which write to memory.

@alexcrichton
Copy link
Member Author

In terms of safety, Module::deserialize_file is already unsafe because loading arbitrary data is not memory safe, so adding to the list of requirements "this needs to stay as-expected for the entire duration of the lifetime of the Module doesn't seem so bad to add. I don't think it makes the safety any worse relative to what we have today?

@alexcrichton
Copy link
Member Author

Hm something I just thought of, I know that on Linux you can't write to an executable that's currently in use in some other process, which is exactly what we want here. I found that MAP_DENYWRITE seems like it would do this as an option to mmap but the man page says that it's an ignored flag nowadays due to denial-of-service attacks (I guess it lets you just exclusively lock files for yourself against everyone else's will). Do others know how we could get this behavior, though? That might help the prevent-future-writes problem a bit.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 10, 2022
In working on bytecodealliance#3787 I see now that our coverage of loading precompiled
files specifically is somewhat lacking, so this adds a config option to
the fuzzers where, if enabled, will round-trip all compiled modules
through the filesystem to test out the mmapped-file case.
@cfallin
Copy link
Member

cfallin commented Feb 10, 2022

In terms of safety, Module::deserialize_file is already unsafe because loading arbitrary data is not memory safe, so adding to the list of requirements "this needs to stay as-expected for the entire duration of the lifetime of the Module doesn't seem so bad to add. I don't think it makes the safety any worse relative to what we have today?

Yup, that sounds like the right path to me; the risk was already there, it's just good to warn about it now :-)

Re: benchmarking -- a thought occurred to me: this is mostly improving module-load performance, so it would be interesting to benchmark module loading! Perhaps a measured inner loop of (module load, instantiate, terminate). In theory, with a module that has a very large heap (ahem SpiderMonkey), we should see load times that are O(heap size) without this PR, and O(1)-ish with this PR.

The above isn't really necessary to get this in I think -- with no negative impacts to instantiation or pagefault cost during runtime, and with the RSS benefit, I'm already convinced it's a good thing; but it would be a good way to demonstrate the benefit if you want to do that.

@cfallin
Copy link
Member

cfallin commented Feb 10, 2022

Hm something I just thought of, I know that on Linux you can't write to an executable that's currently in use in some other process, which is exactly what we want here. I found that MAP_DENYWRITE seems like it would do this as an option to mmap but the man page says that it's an ignored flag nowadays due to denial-of-service attacks (I guess it lets you just exclusively lock files for yourself against everyone else's will). Do others know how we could get this behavior, though? That might help the prevent-future-writes problem a bit.

This sent me down a very interesting rabbithole -- apparently, until recently, MAP_DENYWRITE did work. Indeed if I strace /bin/ls on my system I see the flag included in the mmaps of system libraries. A quick test shows that while one binary executes on my system, I get ETXTBSY when trying to open the file O_RDWR from another process. (Superficially one can "write" it by copying over it or editing with a text editor, but this really just replaces the directory entry, and doesn't open the existing file for writing.)

But it seems that MAP_DENYWRITE is going away -- was removed from the kernel last August -- and so this same issue, if it is one, will exist for system binaries/libraries too.

So just the warning and the fact that the API is unsafe seems enough to me at least!

alexcrichton added a commit that referenced this pull request Feb 10, 2022
In working on #3787 I see now that our coverage of loading precompiled
files specifically is somewhat lacking, so this adds a config option to
the fuzzers where, if enabled, will round-trip all compiled modules
through the filesystem to test out the mmapped-file case.
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a finer-grained pass just now; a few little nits but overall this looks great! Thanks again for implementing the idea -- I think it will be a significant benefit in various use-cases.

@@ -249,9 +296,12 @@ impl ModuleTranslation<'_> {
let mut offset = 0;
for (memory, pages) in page_contents {
let mut page_offsets = Vec::with_capacity(pages.len());
for (byte_offset, page) in pages {
for (page_index, page) in pages {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, was this a bug in the original code? I see the insertion above contents.entry(page_index) so this is correct now, but seemed to be using a page index as a byte offset previously. Or was the first tuple element used as a page index elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at a few other places but I think this was just a typo because everywhere else named and used this as a page index. I was a bit worried myself though and had to do a few double-takes as I updated this!

// The `init_memory` method of `MemoryInitialization` is used here to
// do most of the validation for us, and otherwise the data chunks are
// collected into the `images` array here.
let mut images: PrimaryMap<MemoryIndex, Vec<u8>> = PrimaryMap::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images seems to be created up to num_defined_memories below, but it's a PrimaryMap<MemoryIndex, _>; could we either use DefinedMemoryIndex or fill it up to the total memory count?

This seems a bit different than #3782 as we are type-safe wrt the index, but would just lead to an index-out-of-bounds if there is an imported memory I think...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using MemoryIndex here instead of DefinedMemoryIndex because the interface to init_memory works on MemoryIndex (as an initializer can be for any memory) and otherwise translating between the two would require extra callbacks in the InitMemory::Runtime case.

Otherwise though none of the specialized initialization techniques work with imported memories anyway so I don't think anything is lost with using MemoryIndex since in the cases the optimization is applicable the two are equal. I'll double-check these areas though and make sure they're all prepared to use MemoryIndex.

self.data_align = Some(page_size);
let mut map = PrimaryMap::with_capacity(images.len());
let mut offset = 0u32;
for (defined_memory, mut image) in images {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also the index variable defined_memory seems to imply that images only contains DefinedMemoryIndexs; should be consistent with what we do above. map contains an entry for every memory, defined or imported, so this loop is correct if images is over all memories as well; just a naming issue I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this is a historical name, good catch though and definitely needed a rename.

crates/jit/src/instantiate.rs Outdated Show resolved Hide resolved
crates/runtime/build.rs Outdated Show resolved Hide resolved
crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 10, 2022
This commit moves function names in a module out of the
`wasmtime_environ::Module` type and into separate sections stored in the
final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load
times I noticed that a huge amount of time was spent in deserializing
this map. The `spidermonkey.wasm` file, for example, has a 3MB name
section which is a lot of unnecessary data to deserialize at module load
time.

The names of functions are now split out into their own dedicated
section of the compiled artifact and metadata about them is stored in a
more compact format at runtime by avoiding a `BTreeMap` and instead
using a sorted array. Overall this improves deserialize times by up to
80% for modules with large name sections since the name section is no
longer deserialized at load time and it's lazily paged in as names are
actually referenced.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Feb 10, 2022
This commit has a few minor updates and some improvements to the
instantiation benchmark harness:

* A `once_cell::unsync::Lazy` type is now used to guard creation of
  modules/engines/etc. This enables running singular benchmarks to be
  much faster since the benchmark no longer compiles all other
  benchmarks that are filtered out. Unfortunately I couldn't find a way
  in criterion to test whether a `BenchmarkId` is filtered out or not so
  we rely on the runtime laziness to initialize on the first run for
  benchmarks that do so.

* All files located in `benches/instantiation` are now loaded for
  benchmarking instead of a hardcoded list. This makes it a bit easier
  to throw files into the directory and have them benchmarked instead of
  having to recompile when working with new files.

* Finally a module deserialization benchmark was added to measure the
  time it takes to deserialize a precompiled module from disk (inspired
  by discussion on bytecodealliance#3787)

While I was at it I also upped some limits to be able to instantiate
cfallin's `spidermonkey.wasm`.
@alexcrichton
Copy link
Member Author

a thought occurred to me: this is mostly improving module-load performance, so it would be interesting to benchmark module loading

This is a great idea and something I forgot! I was inspired to write this up in a benchmark from this -- #3790. My first attempt at benchmarking this showed no improvement from this PR but I was alarmed at the relatively slow deserialize time, which spawned #3789. Upon further thought though I remembered that memfd creation is lazy and consequently not affected by what I was benchmarking (purely deserialization, not deserialization plus instantiation).

I then updated the existing code to more eagerly construct the memfd, and that regressed relative to this PR by ~10x, with most of the time spent in write and memcpy (movment from data segments to the image). So it looks like this can definitely help improve first-instantiate time and we can probably make memfd creation un-lazy after this.


Anyway I will get to the rest of the review in a moment now...

alexcrichton added a commit that referenced this pull request Feb 10, 2022
* Move function names out of `Module`

This commit moves function names in a module out of the
`wasmtime_environ::Module` type and into separate sections stored in the
final compiled artifact. Spurred on by #3787 to look at module load
times I noticed that a huge amount of time was spent in deserializing
this map. The `spidermonkey.wasm` file, for example, has a 3MB name
section which is a lot of unnecessary data to deserialize at module load
time.

The names of functions are now split out into their own dedicated
section of the compiled artifact and metadata about them is stored in a
more compact format at runtime by avoiding a `BTreeMap` and instead
using a sorted array. Overall this improves deserialize times by up to
80% for modules with large name sections since the name section is no
longer deserialized at load time and it's lazily paged in as names are
actually referenced.

* Fix a typo

* Fix compiled module determinism

Need to not only sort afterwards but also first to ensure the data of
the name section is consistent.
This commit updates the memfd support internally to not actually use a
memfd if a compiled module originally came from disk via the
`wasmtime::Module::deserialize_file` API. In this situation we already
have a file descriptor open and there's no need to copy a module's heap
image to a new file descriptor.

To facilitate a new source of `mmap` the currently-memfd-specific-logic
of creating a heap image is generalized to a new form of
`MemoryInitialization` which is attempted for all modules at
module-compile-time. This means that the serialized artifact to disk
will have the memory image in its entirety waiting for us. Furthermore
the memory image is ensured to be padded and aligned carefully to the
target system's page size, notably meaning that the data section in the
final object file is page-aligned and the size of the data section is
also page aligned.

This means that when a precompiled module is mapped from disk we can
reuse the underlying `File` to mmap all initial memory images. This
means that the offset-within-the-memory-mapped-file can differ for
memfd-vs-not, but that's just another piece of state to track in the
memfd implementation.

In the limit this waters down the term "memfd" for this technique of
quickly initializing memory because we no longer use memfd
unconditionally (only when the backing file isn't available).
This does however open up an avenue in the future to porting this
support to other OSes because while `memfd_create` is Linux-specific
both macOS and Windows support mapping a file with copy-on-write. This
porting isn't done in this PR and is left for a future refactoring.

Closes bytecodealliance#3758
Cordon off the Linux-specific bits and enable the memfd support to
compile and run on platforms like macOS which have a Linux-like `mmap`.
This only works if a module is mapped from a precompiled module file on
disk, but that's better than not supporting it at all!
No need to have conditional alignment since their sizes are all aligned
anyway
These functions all work with memory indexes, not specifically defined
memory indexes.
@cfallin
Copy link
Member

cfallin commented Feb 10, 2022

I then updated the existing code to more eagerly construct the memfd, and that regressed relative to this PR by ~10x, with most of the time spent in write and memcpy (movment from data segments to the image). So it looks like this can definitely help improve first-instantiate time and we can probably make memfd creation un-lazy after this.

This is great! And yeah, I agree, given that the memfd state is now cheap to create on load (a wrapper around the Arc<File> with some metadata basically) there's no reason for it to be lazy anymore.

alexcrichton added a commit that referenced this pull request Feb 10, 2022
This commit has a few minor updates and some improvements to the
instantiation benchmark harness:

* A `once_cell::unsync::Lazy` type is now used to guard creation of
  modules/engines/etc. This enables running singular benchmarks to be
  much faster since the benchmark no longer compiles all other
  benchmarks that are filtered out. Unfortunately I couldn't find a way
  in criterion to test whether a `BenchmarkId` is filtered out or not so
  we rely on the runtime laziness to initialize on the first run for
  benchmarks that do so.

* All files located in `benches/instantiation` are now loaded for
  benchmarking instead of a hardcoded list. This makes it a bit easier
  to throw files into the directory and have them benchmarked instead of
  having to recompile when working with new files.

* Finally a module deserialization benchmark was added to measure the
  time it takes to deserialize a precompiled module from disk (inspired
  by discussion on #3787)

While I was at it I also upped some limits to be able to instantiate
cfallin's `spidermonkey.wasm`.
@alexcrichton alexcrichton merged commit c0c368d into bytecodealliance:main Feb 10, 2022
@alexcrichton alexcrichton deleted the reuse-cwasm-mmap branch February 10, 2022 21:40
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
In working on bytecodealliance#3787 I see now that our coverage of loading precompiled
files specifically is somewhat lacking, so this adds a config option to
the fuzzers where, if enabled, will round-trip all compiled modules
through the filesystem to test out the mmapped-file case.
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
* Move function names out of `Module`

This commit moves function names in a module out of the
`wasmtime_environ::Module` type and into separate sections stored in the
final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load
times I noticed that a huge amount of time was spent in deserializing
this map. The `spidermonkey.wasm` file, for example, has a 3MB name
section which is a lot of unnecessary data to deserialize at module load
time.

The names of functions are now split out into their own dedicated
section of the compiled artifact and metadata about them is stored in a
more compact format at runtime by avoiding a `BTreeMap` and instead
using a sorted array. Overall this improves deserialize times by up to
80% for modules with large name sections since the name section is no
longer deserialized at load time and it's lazily paged in as names are
actually referenced.

* Fix a typo

* Fix compiled module determinism

Need to not only sort afterwards but also first to ensure the data of
the name section is consistent.
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
This commit has a few minor updates and some improvements to the
instantiation benchmark harness:

* A `once_cell::unsync::Lazy` type is now used to guard creation of
  modules/engines/etc. This enables running singular benchmarks to be
  much faster since the benchmark no longer compiles all other
  benchmarks that are filtered out. Unfortunately I couldn't find a way
  in criterion to test whether a `BenchmarkId` is filtered out or not so
  we rely on the runtime laziness to initialize on the first run for
  benchmarks that do so.

* All files located in `benches/instantiation` are now loaded for
  benchmarking instead of a hardcoded list. This makes it a bit easier
  to throw files into the directory and have them benchmarked instead of
  having to recompile when working with new files.

* Finally a module deserialization benchmark was added to measure the
  time it takes to deserialize a precompiled module from disk (inspired
  by discussion on bytecodealliance#3787)

While I was at it I also upped some limits to be able to instantiate
cfallin's `spidermonkey.wasm`.
mpardesh pushed a commit to avanhatt/wasmtime that referenced this pull request Mar 17, 2022
…tecodealliance#3787)

* Skip memfd creation with precompiled modules

This commit updates the memfd support internally to not actually use a
memfd if a compiled module originally came from disk via the
`wasmtime::Module::deserialize_file` API. In this situation we already
have a file descriptor open and there's no need to copy a module's heap
image to a new file descriptor.

To facilitate a new source of `mmap` the currently-memfd-specific-logic
of creating a heap image is generalized to a new form of
`MemoryInitialization` which is attempted for all modules at
module-compile-time. This means that the serialized artifact to disk
will have the memory image in its entirety waiting for us. Furthermore
the memory image is ensured to be padded and aligned carefully to the
target system's page size, notably meaning that the data section in the
final object file is page-aligned and the size of the data section is
also page aligned.

This means that when a precompiled module is mapped from disk we can
reuse the underlying `File` to mmap all initial memory images. This
means that the offset-within-the-memory-mapped-file can differ for
memfd-vs-not, but that's just another piece of state to track in the
memfd implementation.

In the limit this waters down the term "memfd" for this technique of
quickly initializing memory because we no longer use memfd
unconditionally (only when the backing file isn't available).
This does however open up an avenue in the future to porting this
support to other OSes because while `memfd_create` is Linux-specific
both macOS and Windows support mapping a file with copy-on-write. This
porting isn't done in this PR and is left for a future refactoring.

Closes bytecodealliance#3758

* Enable "memfd" support on all unix systems

Cordon off the Linux-specific bits and enable the memfd support to
compile and run on platforms like macOS which have a Linux-like `mmap`.
This only works if a module is mapped from a precompiled module file on
disk, but that's better than not supporting it at all!

* Fix linux compile

* Use `Arc<File>` instead of `MmapVecFileBacking`

* Use a named struct instead of mysterious tuples

* Comment about unsafety in `Module::deserialize_file`

* Fix tests

* Fix uffd compile

* Always align data segments

No need to have conditional alignment since their sizes are all aligned
anyway

* Update comment in build.rs

* Use rustix, not `region`

* Fix some confusing logic/names around memory indexes

These functions all work with memory indexes, not specifically defined
memory indexes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants