-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mmap'd *.cwasm
as a source for memory initialization images
#3787
Use mmap'd *.cwasm
as a source for memory initialization images
#3787
Conversation
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments; I'll read this over and digest more tomorrow. But: I'm really excited to see this!
I'm obviously curious about benchmarks but I'm sure those are coming :-) Definitely would want to see effect on both instantiation and runtime perf (we want to make sure the mmap-of-disk vs mmap-of-memfd doesn't have any ill effects on CoW performance once the file is hot in pagecache).
One possibly subtle thing to note in Module
API documentation or elsewhere: bad things will happen if the file on disk changes while the Module
exists. This is already very much a bad idea prior to this PR, because we use the mmap for other things (JIT code! wasm_data
!) too, but mapping the live Wasm heap data from the file brings the risk front-of-mind, and makes me think we should warn about it. It's not an unreasonable expectation that a "load a thing from a file" API would read the file once and then be done with it, otherwise.
A very conservative answer to the above risk would be a new Module
constructor that is something like Module::map_file_direct(File)
, and then explicitly avoid the direct-mmap for the existing API that just takes a filename -- though I'm ambivalent about that as it would mean perf benefits are hidden in a non-default place that the average user might not find either.
Given that as you say this already is a bad idea, I think the downside you point out here weighs heavy. Is there any kind of actual safety downside to this relative to the current situation? If not, then we should either make the current behavior safer by default, or continue making use of the benefits we get from this approach. (I also think that it's fine to assume continued existence of files like this, with the argument being "don't shared libraries work this way, too" or something similarly handwavy in that direction.) |
For performance numbers I actually naively assumed it wouldn't really matter here. I don't think our microbenchmarks will really show much today though since it's all just hitting the kernel's page cache. I don't know enough about mmap performance from files to know whether this is what we'd continue to see out in the wild though. What I got though for "instantiate large module" is below. In this benchmark it was purely instantiation so only setting up vma mappings and not actually doing anything with them:
I then modified the benchmark to write 0, from Rust, to every single page of memory after instantiation succeeded and I got:
Running a few times this seemed consistent, although the "load all the pages" had some variance in time to the point that I think they were roughly equivalent. The full diff for the benchmark was https://gist.github.com/alexcrichton/7e555a405f2815ec69fcac659b5d85de and the first numbers were collected without the changes to the |
In terms of safety, |
Hm something I just thought of, I know that on Linux you can't write to an executable that's currently in use in some other process, which is exactly what we want here. I found that |
In working on bytecodealliance#3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.
Yup, that sounds like the right path to me; the risk was already there, it's just good to warn about it now :-) Re: benchmarking -- a thought occurred to me: this is mostly improving module-load performance, so it would be interesting to benchmark module loading! Perhaps a measured inner loop of (module load, instantiate, terminate). In theory, with a module that has a very large heap (ahem SpiderMonkey), we should see load times that are O(heap size) without this PR, and O(1)-ish with this PR. The above isn't really necessary to get this in I think -- with no negative impacts to instantiation or pagefault cost during runtime, and with the RSS benefit, I'm already convinced it's a good thing; but it would be a good way to demonstrate the benefit if you want to do that. |
This sent me down a very interesting rabbithole -- apparently, until recently, But it seems that MAP_DENYWRITE is going away -- was removed from the kernel last August -- and so this same issue, if it is one, will exist for system binaries/libraries too. So just the warning and the fact that the API is |
In working on #3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a finer-grained pass just now; a few little nits but overall this looks great! Thanks again for implementing the idea -- I think it will be a significant benefit in various use-cases.
@@ -249,9 +296,12 @@ impl ModuleTranslation<'_> { | |||
let mut offset = 0; | |||
for (memory, pages) in page_contents { | |||
let mut page_offsets = Vec::with_capacity(pages.len()); | |||
for (byte_offset, page) in pages { | |||
for (page_index, page) in pages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, was this a bug in the original code? I see the insertion above contents.entry(page_index)
so this is correct now, but seemed to be using a page index as a byte offset previously. Or was the first tuple element used as a page index elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at a few other places but I think this was just a typo because everywhere else named and used this as a page index. I was a bit worried myself though and had to do a few double-takes as I updated this!
crates/environ/src/module.rs
Outdated
// The `init_memory` method of `MemoryInitialization` is used here to | ||
// do most of the validation for us, and otherwise the data chunks are | ||
// collected into the `images` array here. | ||
let mut images: PrimaryMap<MemoryIndex, Vec<u8>> = PrimaryMap::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
images
seems to be created up to num_defined_memories
below, but it's a PrimaryMap<MemoryIndex, _>
; could we either use DefinedMemoryIndex
or fill it up to the total memory count?
This seems a bit different than #3782 as we are type-safe wrt the index, but would just lead to an index-out-of-bounds if there is an imported memory I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up using MemoryIndex
here instead of DefinedMemoryIndex
because the interface to init_memory
works on MemoryIndex
(as an initializer can be for any memory) and otherwise translating between the two would require extra callbacks in the InitMemory::Runtime
case.
Otherwise though none of the specialized initialization techniques work with imported memories anyway so I don't think anything is lost with using MemoryIndex
since in the cases the optimization is applicable the two are equal. I'll double-check these areas though and make sure they're all prepared to use MemoryIndex
.
crates/environ/src/module.rs
Outdated
self.data_align = Some(page_size); | ||
let mut map = PrimaryMap::with_capacity(images.len()); | ||
let mut offset = 0u32; | ||
for (defined_memory, mut image) in images { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also the index variable defined_memory
seems to imply that images
only contains DefinedMemoryIndex
s; should be consistent with what we do above. map
contains an entry for every memory, defined or imported, so this loop is correct if images
is over all memories as well; just a naming issue I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this is a historical name, good catch though and definitely needed a rename.
This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced.
This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on bytecodealliance#3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.
This is a great idea and something I forgot! I was inspired to write this up in a benchmark from this -- #3790. My first attempt at benchmarking this showed no improvement from this PR but I was alarmed at the relatively slow deserialize time, which spawned #3789. Upon further thought though I remembered that memfd creation is lazy and consequently not affected by what I was benchmarking (purely deserialization, not deserialization plus instantiation). I then updated the existing code to more eagerly construct the memfd, and that regressed relative to this PR by ~10x, with most of the time spent in Anyway I will get to the rest of the review in a moment now... |
* Move function names out of `Module` This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by #3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced. * Fix a typo * Fix compiled module determinism Need to not only sort afterwards but also first to ensure the data of the name section is consistent.
This commit updates the memfd support internally to not actually use a memfd if a compiled module originally came from disk via the `wasmtime::Module::deserialize_file` API. In this situation we already have a file descriptor open and there's no need to copy a module's heap image to a new file descriptor. To facilitate a new source of `mmap` the currently-memfd-specific-logic of creating a heap image is generalized to a new form of `MemoryInitialization` which is attempted for all modules at module-compile-time. This means that the serialized artifact to disk will have the memory image in its entirety waiting for us. Furthermore the memory image is ensured to be padded and aligned carefully to the target system's page size, notably meaning that the data section in the final object file is page-aligned and the size of the data section is also page aligned. This means that when a precompiled module is mapped from disk we can reuse the underlying `File` to mmap all initial memory images. This means that the offset-within-the-memory-mapped-file can differ for memfd-vs-not, but that's just another piece of state to track in the memfd implementation. In the limit this waters down the term "memfd" for this technique of quickly initializing memory because we no longer use memfd unconditionally (only when the backing file isn't available). This does however open up an avenue in the future to porting this support to other OSes because while `memfd_create` is Linux-specific both macOS and Windows support mapping a file with copy-on-write. This porting isn't done in this PR and is left for a future refactoring. Closes bytecodealliance#3758
Cordon off the Linux-specific bits and enable the memfd support to compile and run on platforms like macOS which have a Linux-like `mmap`. This only works if a module is mapped from a precompiled module file on disk, but that's better than not supporting it at all!
No need to have conditional alignment since their sizes are all aligned anyway
These functions all work with memory indexes, not specifically defined memory indexes.
55a34e8
to
6b5bd2a
Compare
This is great! And yeah, I agree, given that the memfd state is now cheap to create on load (a wrapper around the |
This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on #3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.
In working on bytecodealliance#3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.
* Move function names out of `Module` This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced. * Fix a typo * Fix compiled module determinism Need to not only sort afterwards but also first to ensure the data of the name section is consistent.
This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on bytecodealliance#3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.
…tecodealliance#3787) * Skip memfd creation with precompiled modules This commit updates the memfd support internally to not actually use a memfd if a compiled module originally came from disk via the `wasmtime::Module::deserialize_file` API. In this situation we already have a file descriptor open and there's no need to copy a module's heap image to a new file descriptor. To facilitate a new source of `mmap` the currently-memfd-specific-logic of creating a heap image is generalized to a new form of `MemoryInitialization` which is attempted for all modules at module-compile-time. This means that the serialized artifact to disk will have the memory image in its entirety waiting for us. Furthermore the memory image is ensured to be padded and aligned carefully to the target system's page size, notably meaning that the data section in the final object file is page-aligned and the size of the data section is also page aligned. This means that when a precompiled module is mapped from disk we can reuse the underlying `File` to mmap all initial memory images. This means that the offset-within-the-memory-mapped-file can differ for memfd-vs-not, but that's just another piece of state to track in the memfd implementation. In the limit this waters down the term "memfd" for this technique of quickly initializing memory because we no longer use memfd unconditionally (only when the backing file isn't available). This does however open up an avenue in the future to porting this support to other OSes because while `memfd_create` is Linux-specific both macOS and Windows support mapping a file with copy-on-write. This porting isn't done in this PR and is left for a future refactoring. Closes bytecodealliance#3758 * Enable "memfd" support on all unix systems Cordon off the Linux-specific bits and enable the memfd support to compile and run on platforms like macOS which have a Linux-like `mmap`. This only works if a module is mapped from a precompiled module file on disk, but that's better than not supporting it at all! * Fix linux compile * Use `Arc<File>` instead of `MmapVecFileBacking` * Use a named struct instead of mysterious tuples * Comment about unsafety in `Module::deserialize_file` * Fix tests * Fix uffd compile * Always align data segments No need to have conditional alignment since their sizes are all aligned anyway * Update comment in build.rs * Use rustix, not `region` * Fix some confusing logic/names around memory indexes These functions all work with memory indexes, not specifically defined memory indexes.
This commit addresses #3758 and makes it possible to avoid
memfd_create
when loading a module from a precompiled binary stored on disk. In this situation we already mmap the file from disk, and we can use the same technique that memfd uses where a copy-on-write mapping is made whenever a module is instantiated. This measn that all Unix platforms, not only Linux, can benefit from copy-on-write so long as the module comes from a precompiled module on disk.The first commit here is refactoring to enable this functionality on Linux. After the first commit we avoid creation of a
memfd
and instead map the raw underlying*.cwasm
into memories. This involved moving the creation of the memory image to compile-time ofModule
rather than construction-time ofModule
, as well as aligning the data section to ensure it shows up at a page-offset boundary in the file (which is required bymmap
). The second commit then enables this support to work on macOS which involved some#[cfg]
followed by tweaking themadvise
logic to instead blow away the mapping (no reuse on systems withoutmadvise(DONTNEED)
as there's no portable way to reset the CoW overlay)I tried for a bit to get this working on Windows, but while I could get things to compile I don't believe the same technique we're using here for Unix works on Windows. Windows appears to reject mapping a file onto a pre-existing region allocated with
VirtualAlloc
, meaning that all attemps to map the file into memory have failed so far for me. This StackOverflow question seems to suggest that this may simply not work on Windows unless we use undocumented APIs. In any case the major benefit of this PR is avoiding extra file descriptors on Unix for modules created from files on disk, so while having Windows support would be nice it's not necessarily required.