-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AddressSpaceManager #214
Conversation
0ffe3e1
to
5a1f6ac
Compare
This change brings in a new approach to managing address space. It wraps the Pal with a power of two reservation system, that guarantees all returned blocks are naturally aligned to their size. It either lets the Pal perform aligned requests, or over allocates and splits into power of two blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a really nice cleanup!
``` | ||
Only one of these needs to be implemented, depending on whether the underlying | ||
system can provide strongly aligned memory regions. | ||
If the system guarantees only page alignment, implement the second and snmalloc | ||
will over-allocate and then trim the requested region. | ||
If the system guarantees only page alignment, implement the second. The Pal is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you document that the second one does not commit and why (I presume because the caller will always use a subset of the allocated space and so it makes no sense to atomically commit it and then decommit a chunk)?
src/mem/address_space.h
Outdated
namespace snmalloc | ||
{ | ||
template<typename Pal> | ||
class AddressSpaceManager : Pal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can we have a doc comment explaining what this class is for?
It looks as if we have one, but inside the class? I've only ever seen doc comments there in the Windows NT codebase, I don't think any open source tooling can handle them in that location.
src/mem/address_space.h
Outdated
// There are a maximum of two blocks for any size/align in a range. | ||
// One before the point of maximum alignment, and one after. | ||
// As we can add multiple ranges the second entry's block may contain | ||
// a pointer to subsequent blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this and the comment below into doc comments and expand them. It isn't completely clear to me what this is. I think it is an array of doubly linked lists, indexed by the log2(size) of the allocation (though it seems to be larger than it needs to be: if bits::BITS
is 64, we have at most 51 used entries in this with 4KiB pages, even on systems that use the full 4KiB range).
Why are you using a size-two array instead of a pair?
} | ||
|
||
// Look for larger block and split up recursively | ||
void* bigger = remove_block(align_bits + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if there's no combine operation, what is the fragmentation implication here? Should we always round up requested allocation sizes to a multiple of the superslab size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snmalloc never returns address space, so we never have the chance to consolidate.
You should ask for the size you need, it will do the best based on the alignment.
// considerably, and should never be on the fast path. | ||
std::atomic_flag spin_lock = ATOMIC_FLAG_INIT; | ||
|
||
inline void check_block(void* base, size_t align_bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add doc comments for these methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with writing doc comments for most of these methods as the name is as descriptive as the comment I would write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read the implementations of these to understand why they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code doesn't normally document "why they are used" but "what it does". Personally, I find the name as descriptive as the comment. If it has to document its use site, then in this instance it is okay, as most are used in a single place, but normally not. I could inline them, but the code becomes much less readable.
{ | ||
constexpr size_t min_size = | ||
bits::is64() ? bits::one_at_bit(32) : bits::one_at_bit(28); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these magic numbers doing? Ensuring that we allocate at least 256MiB chunks on 32-bit and 4GiB chunks on 64-bit platforms? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To perform overallocation, and the platform decides on this. So on OE, it can return the whole heap, on 32bit overallocate, but not too large, on 64bit can go larger. The numbers might need refinement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. There's a non-trivial overhead from the kernel having to maintain virtual memory metadata for these large chunks, so a comment here explaining the tradeoff is probably a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think other numbers would work better happy to change them.
@@ -24,87 +24,28 @@ namespace snmalloc | |||
|
|||
// There are a maximum of two blocks for any size/align in a range. | |||
// One before the point of maximum alignment, and one after. | |||
static inline std::array<std::array<void*, 2>, bits::BITS> ranges; | |||
static inline void* heap_base = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc comment and the comment above doesn't seem to reflect the code.
@achamayou, @anakrish this should bring the OE memory requirements down further. Should now be minimum (256KiB * (N + 1) for N threads. Obviously, if you use a lot of memory, then you will need more, but these are the minimum requirements to set up the global and per-thread data structures. This improves a little on #212, which give (256KiB * (N+2)). |
Addresses #213 |
So I ran the mimalloc benckmark, to see about perf:
The drop in RSS for this change is pretty good. I think this is interacting much better with transparent huge pages. We have a less fragmented address space, so it can consolidate better. Overall, I would say performance is not really changed, possibly slightly improved, but would need a lot more runs to confirm. Here is a second run, where I added the Open Enclave slab sizes into the mix.
The new configuration is generally performing well, except for Larson. The two tables, are both just single runs, so pretty noisy. |
I disabled transparent huge pages and there is no change above noise:
|
Nice! It looks as if the 1MiB slab size has a noticeable impact on RSS but a small (and sometimes positive) impact on performance. I wonder if we should consider adjusting the default. The smallest superpage size on x86 is 2MiB and 1MiB on ARM, perhaps we should consider adding a |
@davidchisnall I agree we should look at the defaults. I think assessing that needs some time invested in statistically significant benchmarking of parameter sweeps, which I don't have time to set up currently. I.e. it is not part of the PR ;-) |
This change brings in a new approach to managing address space.
It wraps the Pal with a power of two reservation system, that
guarantees all returned blocks are naturally aligned to their size. It
either lets the Pal perform aligned requests, or over allocates and
splits into power of two blocks.