Memory management in the C runtime #805

petervdonovan · 2021-12-16T01:10:45Z

petervdonovan
Dec 16, 2021
Maintainer

The evidence that motivates this is limited, but since it might be relevant to various aspects of the C target, it seemed okay to start this discussion sooner rather than later.

Problems

Heap space allocated by the user is mixed with space allocated by the runtime, leading to possible performance issues that could be hard for the user (and us) to interpret. This claim is based on the (slim) evidence described under "initial motivation."
Some platforms that we target with the C runtime have memory limitations, and out-of-memory errors need to be foreseeable by the user in order for LF to be safe.
Having data structures like resizing arrays and priority queues that resize dynamically is convenient, and it is nice that pushing and popping to/from these data structures has good time complexity in the average case. However, adding an element is linear wrt the length of the underlying array in the worst case (when it has to be copied). Might that be an issue for some real-time systems?
Perhaps it is true in general that disorganized heap allocations are bad for timing reliability.

Initial Motivation

While optimizing the C runtime, I was surprised to find that I had introduced performance problems for benchmarks in which more than 99% of the time is spent executing user-supplied C code (a reaction body or function in a preamble).

In the case of SortedLinkList, 99.7% of cycles were spent traversing the list, and yet bad memory management in a version of the runtime (my mistake; recently fixed by Soroush) caused execution time to increase by 20-60%, depending on the hardware. In the case of NQueens, 99.5% of cycles were spent accessing arrays that were dynamically allocated in a preamble function, but changes to the runtime made it about 20% slower on one machine. The extreme hardware dependence here might be a problem.

I am unsure if I understand these performance issues correctly. Additionally, the first one was due to a mistake on my part that needed to be fixed anyway. However, the instruction counts, L1 cache miss counts, and branch misprediction counts that I got from perf did not seem to explain the issues, nor did the last-level cache miss counts that were estimated by a cache simulator (cachegrind). Furthermore, both cases seem to show complex interactions between the runtime and apparently unrelated C functions written by the user. This is why my tentative guess is that a) these problems arise from heap fragmentation, and b) they could potentially be a design consideration that goes beyond one or two isolated programming errors.

(I would welcome and appreciate other interpretations of the data, of course, and can share more details if anyone thinks it relevant.)

Concrete Proposal

This might not be quite the right idea, but hopefully it conveys my point:

It would be nice if we could set aside a known, fixed quantity of memory for the runtime, and then give the user a pristine heap (or a pristine area of the heap) that is "like new" -- unused and unfragmented by the runtime. The user can then write reactions and functions that use malloc, calloc, realloc, etc., with as clear an understanding of the state of the heap as if s/he were writing pure C.

It could work like this:

The code generator does everything it does right now, including generating setup code (for creating reactions, ports, etc.). The difference is that the setup code is compiled separately from the other code, and it takes a "memory allocator" as a parameter.
- A "memory allocator" can just be a struct with pointers to functions for allocating memory, akin to malloc and friends.
The setup code is executed using a dummy memory allocator. The dummy allocator records how much memory is requested so that the right amount of memory can be set aside in step 3.
- The dummy allocator might delegate to malloc and friends, then free everything when the mock setup is done, or it might not allocate memory at all -- I'm not sure.
- The functions provided by the allocator could even take a string as an argument, stating the purpose of the allocated memory. Then lfc could print a message summarizing what the runtime is using space for.
The code generator generates code to allocate a single known, fixed quantity of memory for the runtime to use. When the LF program executes, it uses a real allocator (instead of the dummy that was used before) that parcels out chunks of that fixed area of memory. This "allocator" operates at a higher level of abstraction than malloc and does not directly interact with the OS.
- This fixed area of memory could initially be set aside using malloc. Alternatively, maybe it could be a static variable, forcing us to recompile the program when parameters change.

Challenges:

We would have to address the fact that we currently use data structures whose size changes at runtime. The priority queues are an example. It would be necessary to compute an upper bound on the amount of space needed for these data structures. I am not sure if that is possible, but it is at least possible for the reaction queue, especially if we use levels instead of chain ID. No reaction is enqueued twice, and only reactions from two consecutive levels can be enqueued at the same time, right? So that at least could be given a hard upper bound.
- Maybe upper-bounding the size of the event queue would be impossible, or maybe some of @lsk567's work could help?
- If this works, then perhaps we could even omit bounds checking in some cases (except in debug mode, using assert).
- If we know that two structures contain disjoint subsets of a finite set whose size is known at compile time, we could even put those structures next to each other so that they share space and grow toward each other.
Mutations could complicate this. One approach would be to respond to a mutation by checking if it presents a potential buffer overrun or other out-of-memory problem, and then to proactively copy data structures to new areas of memory (or crash the program immediately, in the spirit of failing fast) so that the data structures themselves do not have to constantly worry about running out of memory.
We would have to write our own allocator. However, this might be easy if the structures used by the runtime are never freed until shutdown. That way, we do not have to keep track of many little areas of memory that are freed and want to be filled.

Potential Advantages

Interaction between heap allocations by the runtime and heap allocations by the user's code could be minimized. This could make it easier to optimize the runtime without worrying about the effects on the user's code, and likewise it could make it easier for the user to interpret the performance of his/her code without wondering how it is affected by the runtime.
The user would know at compile time how much memory the runtime needs and why.
We would have control over the spatial organization of the data used by the runtime, creating opportunities to improve cache performance.
If we determine how much space to set aside for a data structure at compile time, then adding an element to it will always be fast because the structure will never need to copy itself to a more spacious location.

Potential Disadvantages

Interaction between the performance of the runtime and the performance of the user's code is inevitable because both affect the state of the cache. A new memory management system cannot solve that.
Due to the need for conservative upper bounds on memory requirements, it is possible that more memory would be set aside for the runtime than necessary. However, resizing arrays are up to 50% unused if they scale by factors of 2, so it seems non-obvious that conservative upper bounds would be much worse than the sort of techniques we use now.

Possibly related: #793. Additionally, @lhstrh pointed out that some of the EECS 149 students who ran into memory limitations may have something to say about this topic.

edwardalee · 2021-12-16T06:26:43Z

edwardalee
Dec 16, 2021
Maintainer

These are very interesting ideas... We really need to pursue this...

0 replies

cmnrd · 2021-12-16T08:33:47Z

cmnrd
Dec 16, 2021
Maintainer

It is an interesting observation that the memory management in the runtime can influence the performance of the reaction code. My guess is that this is due to memory fragmentation. Most often malloc finds a suitable memory area very quickly, but if memory becomes fragmented due to frequent malloc and free, it can become slower.

Given that the C target is rather static (all parameters are known at compile time), wouldn't we be able to compute the memory size required by the runtime in advance?

That said, I would be hesitant to implement static allocation of a fixed memory area for the reasons you outline above. I think, however, that it could be a good idea to separate runtime memory from user memory. Normally, this is done with memory pools and a custom allocator. The custom allocator would allocate a larger piece of memory, let's say 1 MiB. Then it would serve any memory allocations directly from the pool. Should there be no more space in the pool, it would allocate another 1 MiB chunk to increase the pool size. Obviously there is overhead in this solution since the memory pool needs to be managed. This overhead can be reduced if the pool is intended to only allocate objects of a fixed size, like a node in the priority queue.

I implemented a memory pool for fixed size objects in the C++ runtime in hope to speed up some dynamic data structures. But I found that it performed worse than a plain new (or malloc). So the performance overhead appears to be considerable even for a simple optimized solution. However, this completely depends on the allocation patterns and might look different in the C runtime.

3 replies

edwardalee Dec 16, 2021
Maintainer

The C runtime uses a recycle queue to recycle events on the event queue, avoiding malloc. Once allocated, an event is never freed until program shutdown. However, it would be good to check whether this actually leads to any performance improvement (or degradation). This would require devising a test that hits the event queue hard. Also, probably the recycle queue should maybe have a size bound...

The C runtime uses malloc to allocate various arrays during startup, but currently these are never freed (a big TO_DO), so I don't think this can lead to memory fragmentation. Once the TO_DO is done, absent mutations, they will only be freed at the end of execution, and again could not lead to memory fragmentation.

Bottom line: I am not aware of any mechanism currently in the C runtime that would lead to memory fragmentation.

petervdonovan Dec 16, 2021
Maintainer Author

If memory becomes fragmented due to frequent malloc and free, it can become slower.

I thought it was memory fragmentation as well -- I just was not completely sure because the profiler seemed to say that time was spent accessing dynamically allocated structures, not creating them. There was also an uptick in TLB accesses, but I'm not sure if that is relevant.

Obviously there is overhead in this solution since the memory pool needs to be managed.

However, this completely depends on the allocation patterns and might look different in the C runtime.

Since the C runtime does most of its allocation/deallocation during setup/teardown, maybe the overhead of a custom allocator would be small. And if nothing gets freed until shutdown, the implementation is easy...

petervdonovan Dec 16, 2021
Maintainer Author

I am not aware of any mechanism currently in the C runtime that would lead to memory fragmentation.

Oh -- it sounds like I made an oversight there. I probably observed this fragmentation issue because in the branch I was working in, I had created a resizing array that frees memory when it becomes small, and because I was dynamically allocating and freeing arrays for temporary bookkeeping during setup. It seemed like this would work fine with a custom allocator if temporary arrays used in setup were allocated using malloc (bypassing the allocator), and then freed before any user code was executed (restoring the memory not used by the allocator to its original state).

In the near term at least, it sounds like the easiest solution for preventing heap fragmentation will be to use free as little as possible -- now I know, so I'll keep that in mind. It might become tricky if we decide we want to save space by deallocating memory when data structures become small... but maybe we are not worried about that right now.

petervdonovan · 2022-11-08T02:46:44Z

petervdonovan
Nov 8, 2022
Maintainer Author

In the real-time meeting today @erlingrj had an interesting comment about bounding the size of the event queue that I did not think of in my original post. I think the idea is that if we do not allow a multiple events to be scheduled in one reaction invocation, then the number of events on the event queue could be bounded?

One problem that might need to be solved is that one event could result in multiple different reactions being triggered, which might each schedule an event. So I think that the event queue should steadily grow in a program like this:

target C

main reactor OutOfMemory {
    logical action a
    reaction(startup, a) -> a {=
        lf_schedule(a, 0);
    =}
    reaction(a) -> a {=
        lf_schedule(a, 0);
    =}
}

Unfortunately I was not able to verify that an out of memory error occurs because the memory usage grows so slowly (like, only 80 megabytes after one hour?)

#1464 might be tangentially related to this since it pertains to the dropping of events.

4 replies

erlingrj Nov 8, 2022
Maintainer

I guess you are right. But a program where each action can only possibly be scheduled once at any logical time instant, would have a bound on the size of the event queue.

petervdonovan Nov 8, 2022
Maintainer Author

I'm not sure I understand since there are infinitely many time instants. I think we might need more conditions in order to bound the event queue size. For example, maybe we need there to be an upper bound on the distance out into the future that an event can be scheduled in addition to a min spacing.

lsk567 Nov 8, 2022
Maintainer

For this program specifically, I think it would be semantically equivalent to write it like this

target C

main reactor OutOfMemory {
    logical action a
    reaction(startup) -> a {=
        lf_schedule(a, 0);
    =}
    reaction(a) -> a {=
        lf_schedule(a, 0);
        lf_schedule(a, 0);
    =}
}

In this case, reaction 2 does schedule multiple actions, hence it violates the constraint of not allowing multiple events to be scheduled in one reaction invocation.

erlingrj Nov 8, 2022
Maintainer

OK, a very restrictive rule which would actually ensure it, unless Peter finds a counter example, is:

lf_schedule(a) must be invoked by in a reaction triggered by "a" and it can only occur once. Then a would truly be like a timer with flexible frequency.

edwardalee · 2022-11-08T17:28:31Z

edwardalee
Nov 8, 2022
Maintainer

Counterexample:

reactor A {
    logical action a
    logical action b
    state count:int(0)
    reaction(a, b) -> a, b {=
        lf_schedule(a, 0);
        lf_schedule(b, MSEC(1) * self->count++);
    =}
}

Event queue grows without bound.

2 replies

lsk567 Nov 9, 2022
Maintainer

I notice that if lf_schedule(b, MSEC(1) * self->count++); becomes lf_schedule(b, 0);, the number of events again becomes bounded. In that case, every reaction invocation consumes two events and produces two (simultaneous) events. This seems to resemble the balance equations of SDF. Perhaps there also exists some balance equations for reactors such that, if they are found, the event queue can be bounded.

petervdonovan Nov 9, 2022
Maintainer Author

I think that in the meeting today Edward might have mentioned that even if you require events to be scheduled with offsets that are multiples of some number, and you have a maximum distance out into the future that an event can be scheduled, you can still have events pile up in the microstep dimension. I think #1464 is pertinent to this because solving #1464 would be a step in the direction of being able to schedule at a definite time and microstep into the future. So, when I say the have to be scheduled with offsets that are multiples of some number, that number includes a definite microstep.

schoeberl · 2022-11-09T18:44:35Z

schoeberl
Nov 9, 2022
Collaborator

As briefly proposed today at our meeting, I propose we try a bottom-up approach. Let us build 2-3 simple examples that we can statically analyze and that can result in a possible static schedule. For this I would start by using features that are all visible at LF level. E.g., explicit timers, no event scheduling in the future, but at the same time instant.

2 replies

erlingrj Nov 9, 2022
Maintainer

If we limit ourselves to timers then there is a bound on the event queue, namely the number of timers in the program. The reaction queue will also have a bound dependent on how many reactions the timers trigger and also have many further reactions can be triggered. For a timer-only LF program, the quasi-static scheduler that @lsk567 has worked on would be able to generate a static schedule for each time instant, and we would have compile-time knowledge about what time-instants which would have events

lsk567 Nov 10, 2022
Maintainer

Here is the quasi-static scheduler thread for reference: #1080.

schoeberl · 2022-11-10T04:44:09Z

schoeberl
Nov 10, 2022
Collaborator

I did not mean timers only. I think a reaction triggered by a timer event is fine to fire an output reaction at the same time instant. That is just a dependency between reactions/reactors that can be statically scheduled. The event queue is not necessarily bound by the number of timers. They can have different periods. We need to look at the hyperperiod of the timers.

4 replies

erlingrj Nov 10, 2022
Maintainer

Even though the timers have different periods I believe they can only have a single event on the event queue at any time. This event gets automatically "rescheduled" when it is popped of the event queue. So it actually doesn't matter what the periods are. But it is quite restrictive to remove actions, I guess the actions are what gives Reactors it turing completeness?

erlingrj Nov 10, 2022
Maintainer

Also the ability to have zero-delay feedback loops is lost without actions. With only timers I guess we are just left with a Synchronous language?

lsk567 Nov 10, 2022
Maintainer

Recently I have been developing a state space exploration technique for the purpose of generating fully static schedules. This is related to finding a hyperperiod of the timers above. The problem is analogous to solving for an eigenvalue and an eigenvector using a max-plus linear formulation.

Having static schedules can remove event queues altogether. But if we are using dynamic schedulers with event queues, I agree with @erlingrj that a timer only results in having at most one event on the event queue according to its current implementation.

schoeberl Nov 10, 2022
Collaborator

yes, you are both right with one event per timer. My mistake.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory management in the C runtime #805

{{title}}

Replies: 6 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Memory management in the C runtime #805

petervdonovan Dec 16, 2021 Maintainer

Problems

Initial Motivation

Concrete Proposal

Challenges:

Potential Advantages

Potential Disadvantages

Replies: 6 comments · 15 replies

edwardalee Dec 16, 2021 Maintainer

cmnrd Dec 16, 2021 Maintainer

edwardalee Dec 16, 2021 Maintainer

petervdonovan Dec 16, 2021 Maintainer Author

petervdonovan Dec 16, 2021 Maintainer Author

petervdonovan Nov 8, 2022 Maintainer Author

erlingrj Nov 8, 2022 Maintainer

petervdonovan Nov 8, 2022 Maintainer Author

lsk567 Nov 8, 2022 Maintainer

erlingrj Nov 8, 2022 Maintainer

edwardalee Nov 8, 2022 Maintainer

lsk567 Nov 9, 2022 Maintainer

petervdonovan Nov 9, 2022 Maintainer Author

schoeberl Nov 9, 2022 Collaborator

erlingrj Nov 9, 2022 Maintainer

lsk567 Nov 10, 2022 Maintainer

schoeberl Nov 10, 2022 Collaborator

erlingrj Nov 10, 2022 Maintainer

erlingrj Nov 10, 2022 Maintainer

lsk567 Nov 10, 2022 Maintainer

schoeberl Nov 10, 2022 Collaborator

petervdonovan
Dec 16, 2021
Maintainer

Replies: 6 comments 15 replies

edwardalee
Dec 16, 2021
Maintainer

cmnrd
Dec 16, 2021
Maintainer

edwardalee Dec 16, 2021
Maintainer

petervdonovan Dec 16, 2021
Maintainer Author

petervdonovan Dec 16, 2021
Maintainer Author

petervdonovan
Nov 8, 2022
Maintainer Author

erlingrj Nov 8, 2022
Maintainer

petervdonovan Nov 8, 2022
Maintainer Author

lsk567 Nov 8, 2022
Maintainer

erlingrj Nov 8, 2022
Maintainer

edwardalee
Nov 8, 2022
Maintainer

lsk567 Nov 9, 2022
Maintainer

petervdonovan Nov 9, 2022
Maintainer Author

schoeberl
Nov 9, 2022
Collaborator

erlingrj Nov 9, 2022
Maintainer

lsk567 Nov 10, 2022
Maintainer

schoeberl
Nov 10, 2022
Collaborator

erlingrj Nov 10, 2022
Maintainer

erlingrj Nov 10, 2022
Maintainer

lsk567 Nov 10, 2022
Maintainer

schoeberl Nov 10, 2022
Collaborator