Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace "heap" with "dynamic allocation" in the documentation #18226

Closed
thestinger opened this issue Oct 22, 2014 · 34 comments
Closed

replace "heap" with "dynamic allocation" in the documentation #18226

thestinger opened this issue Oct 22, 2014 · 34 comments

Comments

@thestinger
Copy link
Contributor

The documentation should be conveying the language semantics, not an inaccurate analogy of the implementation details. Rust never makes use of the dss section (aka the heap) by default as jemalloc prefers obtaining all memory via anonymous memory mappings. Stacks are dynamic allocations coming from the same operating system API as Box<T> and Rc<T>. It's all obtained via mmap and VirtualAlloc. The classical distinction between a static call stack and a dynamic heap doesn't really exist on modern operating systems since multi-threading means stacks are dynamic memory allocations.

@zwarich
Copy link

zwarich commented Oct 22, 2014

To be fair, the term heap is commonly used without confusion on platforms like OS X where the standard allocator uses the equivalent of anonymous mappings.

@bstrie
Copy link
Contributor

bstrie commented Oct 22, 2014

I tend to agree, to me the idea of a "heap" conjures images of Java.

On the other hand, the "heap vs. stack" dichotomy, imagined or not, is quite popular. I'm leery of giving people the impression "lol rust allocates its stack on the HEAP? nah I'll go back to a real lang like c++ thx".

@thestinger
Copy link
Contributor Author

On the other hand, the "heap vs. stack" dichotomy, imagined or not, is quite popular. I'm leery of giving people the impression "lol rust allocates its stack on the HEAP? nah I'll go back to a real lang like c++ thx".

You've put forward an argument against a different proposal than the one here. The term heap should be completely removed from the documentation because it distracts from the high level semantics and doesn't actually have a basis in reality. Rust doesn't use the data section referred to as the heap (dss) for either stacks or dynamic memory allocations. The dss section only exists for legacy reasons anyway.

@thestinger
Copy link
Contributor Author

C and C++ don't mention the terms stack or heap at all. The language and library APIs are defined in terms of the high level semantics and the low-level details are left to the implementation. Those low-level details do not really include a distinction between dynamic allocations of stacks for new threads and other dynamic allocations.

@arielb1
Copy link
Contributor

arielb1 commented Oct 22, 2014

While I agree that a C-style language specification, if there was such a document, shouldn't talk about "the stack" and "the heap", the former is a significant component of almost all ABI-s, being the scratch space given to functions, and "heap" is just the short term for "memory managed by the process-global dynamic allocator" (I haven't seen anyone really claim that "the heap" means "the dss section"), so these are terms low-level developers are familiar with, and documentation aimed at these developers should talk about how Rust uses them.

@thestinger
Copy link
Contributor Author

@arielb1: Stacks are dynamic allocations and come from the same OS API as other memory allocations. The usage of an allocator in between these allocations and the OS API is an implementation detail that's likely going to change soon. It doesn't make sense for Rust to call mmap and munmap directly or allow the POSIX thread API to call it because it increases memory fragmentation and decreases performance. It really only needs to deal directly with mprotect.

@thestinger
Copy link
Contributor Author

are terms low-level developers are familiar with, and documentation aimed at these developers should talk about how Rust uses them

General purpose documentation should concentrate on semantics, not implementation details. Documentation of the implementation details or performance characteristics has no use case for familiar but inaccurate ways of describing the functionality.

@thestinger
Copy link
Contributor Author

I haven't seen anyone really claim that "the heap" means "the dss section"

The term isn't used by C and C++ themselves because it's a platform-specific implementation detail. That's how the heap is defined by the Linux kernel and the userspace documentation. The way Rust's documentation uses the term is inaccurate, misleading and confusing. If you want to talk about implementation details, then lying about how it actually works isn't a great start.

For example, the glibc malloc implementation has a MMAP_THRESHOLD variable which it describes as the threshold where it switches from the heap (dss) to anonymous mappings. Rust uses anonymous mappings for all memory allocations.

@arielb1
Copy link
Contributor

arielb1 commented Oct 22, 2014

@thestinger

Of course, both the stack, the heap, static allocation, and manually mmaped zones are all regions of virtual memory that are used in the same way.

However, from within a function, memory from these areas certainly isn't allocated from the same API-s – stack allocations are a pointer bump, heap allocations are function calls, arena allocations via different function calls (which may eventually allocate from other regions) etc. (inside a function, there is also memory accessible via a reference, which can come from anywhere without the function needing to know about it).

Also, the MMAP_THRESHOLD documentation (at man 3 mallopt) does not really talk about heap=dss≠mmap (What it does talk about is the fact that over-threshold allocations aren't on the free-list, but that's a distinct concern, and also with glibc the free-list manages the dss segment, and mallopt is where implementation details are exposed).

@thestinger
Copy link
Contributor Author

@arielb1: The documentation is quite clear that sbrk manages the heap, and it's explicitly referred to as [heap] in /proc/self/maps and elsewhere.

Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2).

@thestinger
Copy link
Contributor Author

However, from within a function, memory from these areas certainly isn't allocated from the same API-s – stack allocations are a pointer bump, heap allocations are function calls, arena allocations via different function calls (which may eventually allocate from other regions) etc. (inside a function, there is also memory accessible via a reference, which can come from anywhere without the function needing to know about it).

You're conflating the usage pattern of the memory after it's allocated with obtaining the memory allocation. Putting local variables on the stack or pushing elements to a vector with reserved capacity is not a memory allocation. The stack itself is a dynamic memory allocation and the call stack manages that memory as a stack data structure.

@thestinger
Copy link
Contributor Author

Also, the MMAP_THRESHOLD documentation (at man 3 mallopt) does not really talk about heap=dss≠mmap (What it does talk about is the fact that over-threshold allocations aren't on the free-list, but that's a distinct concern, and also with glibc the free-list manages the dss segment, and mallopt is where implementation details are exposed).

The documentation is incredibly clear that there's a distinction between the heap (managed by sbrk) and anonymous memory mappings from mmap. It's difficult for me to understand where you're coming from when the language in the documentation is so clear about the distinction.

Allocating memory using mmap(2) has the significant advantage that the allocated memory blocks can always be independently released back to the system. (By contrast, the heap can be trimmed only if memory is freed at the top end.)

Another case where the term "heap" is mentioned:

When the amount of contiguous free memory at the top of the heap grows sufficiently large, free(3) employs sbrk(2) to release this memory back to the system.

Or just look in /proc/self/maps along with the Linux kernel API documentation / code and you'll see that mmap mappings != heap.

@thestinger
Copy link
Contributor Author

What it does talk about is the fact that over-threshold allocations aren't on the free-list, but that's a distinct concern, and also with glibc the free-list manages the dss segment

This isn't true. It uses free lists for more than just managing memory in the heap (dss) because it will make secondary arenas with mmap and they aren't subject to trimming.

@arielb1
Copy link
Contributor

arielb1 commented Oct 22, 2014

I got this distinction from musl libc, which manages the malloc heap entirely with mmap, and also allocates large blocks in separately mmap-ed areas (for some not-particularly-clear reason, it adds a header to these large blocks, ensuring misalignment and wasted pages – but glibc also seems to do it), which are never stored on a free list, but rather free-d immediately.

@thestinger
Copy link
Contributor Author

On Linux, heap refers to the data section managed by sbrk. It's explicit in the documentation and the kernel APIs like /proc/self/maps. If musl doesn't use sbrk then it's not making use of what Linux refers to as the heap. It's not subjective, it's encoded in the kernel ABI.

@arielb1
Copy link
Contributor

arielb1 commented Oct 22, 2014

And if you set %esp to a newly-mmap'd area then %esp doesn't point to the stack? That's not the definition of "the stack" that I'm used to (which is "the region %esp is supposed to point to"), but we're arguing about definitions.

@thestinger
Copy link
Contributor Author

No, we're not arguing about definitions. I'm talking about dynamic memory allocations and you're talking about how the memory is used after it's allocated. It's possible to use a Vec<T> as a call stack by pointing the stack pointer into it. In fact, that's how Rust dealt with segmented stacks in the past. Either way, both are dynamic allocations from the same source. The distinction drawn between them in the documentation is incorrect.

@thestinger
Copy link
Contributor Author

The call stack is a very simple data structure implemented on top of a dynamic memory allocation. It doesn't appear magically without calling into the general purpose allocator or going directly through the operating system API (which I expect is slower than calling mprotect after allocation and before deallocating).

The general purpose documentation should not be getting into these implementation details, and I am going to enforce accuracy in any implementation documentation. That leaves no place for an incorrect distinction between stack and "heap" allocations, because it's not how things are done. It will be even more blatantly incorrect when there's support for allocators.

@wycats
Copy link
Contributor

wycats commented Oct 22, 2014

@thestinger I think you're right about "heap" being not ideal, but I don't think "dynamic allocation" is much better, except perhaps in a narrow formal terminology sense.

@thestinger
Copy link
Contributor Author

All I know is that there used to be accurate high-level documentation on boxes in the tutorial. The new documentation gets bogged down in the implementation details rather than sticking to the language semantics and it isn't correct about those implementation details. I don't think it should get into this stuff at all because it changes from week to week but if it is going to then it needs to be accurate.

@wycats
Copy link
Contributor

wycats commented Oct 22, 2014

@thestinger I like the idea of doubling down on "box" as terminology, especially since it's what the syntax says: box foo not heapify foo.

@thestinger
Copy link
Contributor Author

The box keyword is supposedly a generic placement new feature anyway so it will be used for inserting into collections, making other smart pointer types and those containers / smart pointers may be using custom allocators like an arena on the call stack.

@nikomatsakis
Copy link
Contributor

@thestinger I confess I am somewhat confused. To me, the terms "heap" and "stack" have basically nothing to do with the source of the memory, and everything to do with the usage pattern of the memory. That is, the call stack refers to the big hunk of memory we reserve to store function activations. This is used according to a stack discipline (first in, last out).

The heap, in contrast, refers to memory that is allocated and freed in some other pattern that does not follow a stack discipline. That is, it is not first in, last out. Naturally this requires a more complicated data structure to track what memory is free than a stack does. I don't believe that, in common usage, the word heap is specifically tied to the DSS segment (or any other).

Heap vs stack seems like a useful distinction and one that is very meaningful to Rust. In particular, the stack is tied to the language in the form of lifetimes and so on.

When you say "in the documentation", I guess you mean things like the guide and so forth? I can see the argument for being abstract in a language reference, but I think we are permitted some license in the tutorial. Being overly generic and abstract can make text quite hard to understand. All that said, I could believe that we can use the word "box" rather than "heap" in many places.

@thestinger
Copy link
Contributor Author

It can refer to the call stack and dynamic allocation, but referring to a heap is neither clear or accurate. Pushing a stack frame onto the call stack isn't memory allocation, it's usage of a fixed amount of pre-allocated memory. It's comparable to pushing and popping from a vector. The memory allocated for the call stack is not any easier to track than memory allocated for a vector. In fact, I expect that both will end up using the same allocator code path in the future.

Heap vs stack seems like a useful distinction and one that is very meaningful to Rust. In particular, the stack is tied to the language in the form of lifetimes and so on.

Lifetimes are tied to the call stack, but that call stack is a high-level language concept managing memory spread out across registers, the current thread's call stack and other dynamically allocated memory like Box<T> and Rc<T>.

@comex
Copy link
Contributor

comex commented Oct 24, 2014

I haven't lived long enough to have off-hand references to historical usage, but "heap" is certainly frequently used to refer to any pool of memory from which dynamic allocations are carved out. Using it to refer to some Linux section is a nonstandard and specialized meaning which can be useful, but has little chance of confusing anyone using Rust, especially on any other platform. To wit: OS X frequently uses the term "heap" despite not having such a section. So does Windows. jemalloc and tcmalloc refer to profiling the area of memory they manage as "heap profiling". The term for overflowing a malloced buffer is heap overflow. Wikipedia's memory management article refers to any pool of memory used for allocation as "the heap". Even the Linux kernel has a bunch of references to heap allocation from other heaps.

FWIW, I have literally never heard of .dss, nor is what sbrk manages usually described as "the heap", see one, two, three.

@thestinger
Copy link
Contributor Author

@comex: The areas managed by jemalloc will include our call stacks in the future. The glibc malloc documentation explicitly refers to the sbrk heap as the heap as does the kernel ABI (as [heap] - a special mapping name like [vdso]).

@comex
Copy link
Contributor

comex commented Oct 24, 2014

By doing that, you have allocated your stack from the heap (you requested a portion of the jemalloc heap, and now you are using it as a stack) instead of directly from the kernel. There is no conflict here, any more than there is if you allocate an arena from the global heap and then start allocating sub-objects out of that. What is important is what data structure you directly obtain a particular piece of memory from, as this determines its lifetime semantics. (You could just focus on the lifetime semantics, but anything that dynamically manages memory is a "heap" and stacks are stacks, so you may as well just say the words.)

@thestinger
Copy link
Contributor Author

@comex: I can accept heap as a near synonym for address space but I don't think it's a sensible way to refer to anything not on a call stack. I don't see why an 2MiB vector used as a LIFO data structure would qualify as a heap allocation but the same 2MiB allocation used as a call stack would not.

@arielb1
Copy link
Contributor

arielb1 commented Oct 25, 2014

The allocation of the vector is an heap allocation (i.e. call to malloc or je_malloc or whatever), and the same goes for the allocation of the stack (inside task::spawn), uses of it aren't. In the future, we could possibly have box place an element in a vector which is in an arena which is on the stack which is in an mmaped region that is part of the heap. That call to box won't allocate anything on either the stack or heap.

In that context, I could store an element in the vector, in a new place in the arena, in a new local variable (stored on the stack), in a new je_malloc allocation, or even in a freshly-mmap-ed region.

Now, vectors are typically used for more than just place to store things in (otherwise they would be called "arenas"), so putting things into vectors is typically not called "allocation". When programs want somewhere to put things into, the "classical" places are either a stack allocation or heap allocation.

@jpetkau
Copy link

jpetkau commented Nov 5, 2014

The use of "heap" to mean "unordered pool of free memory" predates the existence of such things as jemalloc and dss sections by decades. The documentation is conveying the semantics accurately; it's just using a broader and much more established definition of the word "heap".

The fact the word is sometimes used in a narrower sense in other contexts doesn't make this use inaccurate.

@thestinger
Copy link
Contributor Author

@jpetkau: Stack allocations come from the same pool of memory. It also completely ignores the possibility of having custom allocators, which is intended.

@thestinger
Copy link
Contributor Author

Virtual memory / anonymous memory mappings are what makes the term so inaccurate now.

@tomjakubowski
Copy link
Contributor

This was brought up in a discuss thread as well: http://internals.rust-lang.org/t/newcomer-to-rust-my-experience/1816/28

On the topic of borrowing and lifetimes, a lot of people here have been talking about "systems programming". I think this is a red herring. You can perfectly well explain a lot of the core concepts WITHOUT reference to the heap or stack. Just explain it from the point of view of when resources are released or copied. People who know how this works will read between the lines. People who don't, don't need to be distracted!

@steveklabnik
Copy link
Member

With the stack and the heap chapter, exactly what we mean in these contexts has been addressed.

lnicola pushed a commit to lnicola/rust that referenced this issue Oct 17, 2024
Correctly parse `use` in generic parameters

Fixes: rust-lang#18225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants