-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Hugepages support for Firecracker #4360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
2997bf8
Add `huge_pages` field to `machine-config` endpoint
roypat 8bcab5b
chore: Mark huge pages support as developer preview
roypat fb3be86
Wire up huge_page api parameter with memory allocation code
roypat ee71638
Wire up huge pages support with snapshot feature
roypat da62f7d
Gracefully fail if hugetlbfs is attempted to be used on <4.16 host
roypat 24dc497
fix(tests): Add huge_pages parameter to relevant tests
roypat 807e1c7
test: Add test that booting with hugetlbfs memory works
roypat 3d4c674
Generalize uffd handler to allow faulting in huge pages
roypat 0c99b9f
test: Add snapshot restore test for hugetlbfs backed guest
roypat 2a391e1
Disallow simultaneous usage of balloon device and huge pages
roypat fbcc739
Disallow simultaneous usage of initrd and huge pages
roypat ac687a0
test: Add metric tracking the number of EPT_VIOLATIONS post restore
roypat 69ba77f
test: differential snapshots and hugepages works
roypat f789bd5
docs: Add documentation for hugepages feature
roypat 27eeefd
docs: Update swagger.yml with huge_pages field
roypat f68c09a
docs: Update CHANGELOG.md
roypat e179b9f
Merge branch 'main' into hugepages
roypat File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Backing Guest Memory by Huge Pages | ||
|
|
||
| > \[!WARNING\] | ||
| > | ||
| > Support is currently in **developer preview**. See | ||
| > [this section](RELEASE_POLICY.md#developer-preview-features) for more info. | ||
|
|
||
| Firecracker supports backing the guest memory of a VM by 2MB hugetlbfs pages. | ||
kalyazin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| This can be enabled by setting the `huge_pages` field of `PUT` or `PATCH` | ||
| requests to the `/machine-config` endpoint to `2M`. | ||
|
|
||
| Backing guest memory by huge pages can bring performance improvements for | ||
| specific workloads, due to less TLB contention and less overhead during | ||
| virtual->physical address resolution. It can also help reduce the number of | ||
| KVM_EXITS required to rebuild extended page tables post snapshot restore, as | ||
| well as improve boot times (by up to 50% as measured by Firecracker's | ||
| [boot time performance tests](../tests/integration_tests/performance/test_boottime.py)) | ||
|
|
||
| Using hugetlbfs requires the host running Firecracker to have a pre-allocated | ||
| pool of 2M pages. Should this pool be too small, Firecracker may behave | ||
| erratically or receive the `SIGBUS` signal. This is because Firecracker uses the | ||
| `MAP_NORESERVE` flag when mapping guest memory. This flag means the kernel will | ||
| not try to reserve sufficient hugetlbfs pages at the time of the `mmap` call, | ||
| trying to claim them from the pool on-demand. For details on how to manage this | ||
| pool, please refer to the [Linux Documentation][hugetlbfs_docs]. | ||
|
|
||
| ## Huge Pages and Snapshotting | ||
|
|
||
| Restoring a Firecracker snapshot of a microVM backed by huge pages will also use | ||
| huge pages to back the restored guest. There is no option to flip between | ||
| regular, 4K, pages and huge pages at restore time. Furthermore, snapshots of | ||
| microVMs backed with huge pages can only be restored via UFFD. Lastly, note that | ||
| even for guests backed by huge pages, differential snapshots will always track | ||
| write accesses to guest memory at 4K granularity. | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| Currently, hugetlbfs support is mutually exclusive with the following | ||
| Firecracker features: | ||
|
|
||
| - Memory Ballooning via the [Balloon Device](./ballooning.md) | ||
| - Initrd | ||
|
|
||
| ## FAQ | ||
|
|
||
| ### Why does Firecracker not offer a transparent huge pages (THP) setting? | ||
|
|
||
| Firecracker's guest memory is memfd based. Linux (as of 6.1) does not offer a | ||
| way to dynamically enable THP for such memory regions. Additionally, UFFD does | ||
| not integrate with THP (no transparent huge pages will be allocated during | ||
| userfaulting). Please refer to the [Linux Documentation][thp_docs] for more | ||
| information. | ||
|
|
||
| [hugetlbfs_docs]: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html | ||
| [thp_docs]: https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html#hugepages-in-tmpfs-shmem | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| // Helper program for triggering fast page faults after UFFD snapshot restore. | ||
| // Allocates a 128M memory area using mmap, touches every page in it using memset and then | ||
| // calls `sigwait` to wait for a SIGUSR1 signal. Upon receiving this signal, | ||
| // set the entire memory area to 1, to trigger fast page fault. | ||
| // The idea is that an integration test takes a snapshot while the process is | ||
| // waiting for the SIGUSR1 signal, and then sends the signal after restoring. | ||
| // This way, the `memset` will trigger a fast page fault for every page in | ||
| // the memory region. | ||
|
|
||
| #include <stdio.h> // perror | ||
| #include <signal.h> // sigwait and friends | ||
| #include <string.h> // memset | ||
| #include <sys/mman.h> // mmap | ||
|
|
||
| #define MEM_SIZE_MIB (128 * 1024 * 1024) | ||
|
|
||
| int main(int argc, char *const argv[]) { | ||
| sigset_t set; | ||
| int signal; | ||
|
|
||
| sigemptyset(&set); | ||
| if(sigaddset(&set, SIGUSR1) == -1) { | ||
roypat marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| perror("sigaddset"); | ||
| return -1; | ||
| } | ||
|
|
||
| void *ptr = mmap(NULL, MEM_SIZE_MIB, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); | ||
|
|
||
| memset(ptr, 1, MEM_SIZE_MIB); | ||
|
|
||
| if(MAP_FAILED == ptr) { | ||
| perror("mmap"); | ||
| return -1; | ||
| } | ||
|
|
||
| sigwait(&set, &signal); | ||
|
|
||
| memset(ptr, 2, MEM_SIZE_MIB); | ||
|
|
||
| return 0; | ||
| } | ||
kalyazin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| // Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| //! Provides functionality for a userspace page fault handler | ||
| //! which loads the whole region from the backing memory file | ||
| //! when a page fault occurs. | ||
|
|
||
| mod uffd_utils; | ||
|
|
||
| use std::fs::File; | ||
| use std::os::unix::net::UnixListener; | ||
|
|
||
| use uffd_utils::{Runtime, UffdHandler}; | ||
| use utils::get_page_size; | ||
|
|
||
| fn main() { | ||
| let mut args = std::env::args(); | ||
| let uffd_sock_path = args.nth(1).expect("No socket path given"); | ||
| let mem_file_path = args.next().expect("No memory file given"); | ||
|
|
||
| let file = File::open(mem_file_path).expect("Cannot open memfile"); | ||
|
|
||
| // Get Uffd from UDS. We'll use the uffd to handle PFs for Firecracker. | ||
| let listener = UnixListener::bind(uffd_sock_path).expect("Cannot bind to socket path"); | ||
| let (stream, _) = listener.accept().expect("Cannot listen on UDS socket"); | ||
|
|
||
| // Populate a single page from backing memory file. | ||
| // This is just an example, probably, with the worst-case latency scenario, | ||
| // of how memory can be loaded in guest RAM. | ||
| let len = get_page_size().unwrap(); // page size does not matter, we fault in everything on the first fault | ||
|
|
||
| let mut runtime = Runtime::new(stream, file); | ||
| runtime.run(len, |uffd_handler: &mut UffdHandler| { | ||
| // Read an event from the userfaultfd. | ||
| let event = uffd_handler | ||
| .read_event() | ||
| .expect("Failed to read uffd_msg") | ||
| .expect("uffd_msg not ready"); | ||
|
|
||
| match event { | ||
| userfaultfd::Event::Pagefault { .. } => { | ||
| for region in uffd_handler.mem_regions.clone() { | ||
| uffd_handler | ||
| .serve_pf(region.mapping.base_host_virt_addr as _, region.mapping.size) | ||
| } | ||
| } | ||
| _ => panic!("Unexpected event on userfaultfd"), | ||
| } | ||
| }); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.