Reduce memory usage for instruction block #232

RinHizakura · 2023-09-26T04:00:45Z

The original strategy for the allocation of instruction blocks is waste of memory. For every single block, we always create a space that can contain (1 << 10) rv_insn_t. However, we usually don't have that much instruction in one block in most of the case.

We can simply know the heap usage by using valgrind. We can see that 20,306,989 bytes are allocated on the run of puzzle.elf for the old design.

$ valgrind ./build/rv32emu ./build/puzzle.elf
==62803== Memcheck, a memory error detector
==62803== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==62803== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==62803== Command: ./build/rv32emu ./build/puzzle.elf
==62803== 
==62803== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (defined)
success in 2005 trials
inferior exit code 0
==62803== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (noaccess)
==62803== 
==62803== HEAP SUMMARY:
==62803==     in use at exit: 39,124 bytes in 259 blocks
==62803==   total heap usage: 1,281 allocs, 1,022 frees, 20,306,989 bytes allocated
==62803== 
==62803== LEAK SUMMARY:
==62803==    definitely lost: 0 bytes in 0 blocks
==62803==    indirectly lost: 0 bytes in 0 blocks
==62803==      possibly lost: 0 bytes in 0 blocks
==62803==    still reachable: 37,108 bytes in 238 blocks
==62803==         suppressed: 0 bytes in 0 blocks
==62803== Rerun with --leak-check=full to see details of leaked memory
==62803== 
==62803== For lists of detected and suppressed errors, rerun with: -s
==62803== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

To address the issue, we can simply maintain a pool of rv_insn_t, and take only the required numbers of rv_insn_t space from it. This ensures heap allocation only happens when the pool is out of rv_insn_t. By using the parameter BLOCK_POOL_SIZE, we have the flexibility to get the balance between the numbers of calloc calls and the memory usage.

In this way, we have great improvement for the heap memory allocation. As the following result, only 313,461 bytes are allocated on the puzzle.elf example.

$ valgrind ./build/rv32emu ./build/puzzle.elf
==62927== Memcheck, a memory error detector
==62927== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==62927== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==62927== Command: ./build/rv32emu ./build/puzzle.elf
==62927== 
==62927== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (defined)
success in 2005 trials
inferior exit code 0
==62927== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (noaccess)
==62927== 
==62927== HEAP SUMMARY:
==62927==     in use at exit: 39,124 bytes in 259 blocks
==62927==   total heap usage: 940 allocs, 681 frees, 313,461 bytes allocated
==62927== 
==62927== LEAK SUMMARY:
==62927==    definitely lost: 0 bytes in 0 blocks
==62927==    indirectly lost: 0 bytes in 0 blocks
==62927==      possibly lost: 0 bytes in 0 blocks
==62927==    still reachable: 37,108 bytes in 238 blocks
==62927==         suppressed: 0 bytes in 0 blocks
==62927== Rerun with --leak-check=full to see details of leaked memory
==62927== 
==62927== For lists of detected and suppressed errors, rerun with: -s
==62927== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Because two instructions in sequence may now be separated into two discontinuous memory spaces, a drawback of this design is the cost of random access to the instruction. It seems that we only need random access to the instructions for some fuse operations, I think this might not be a big problem. We can still linear search that instruction with GET_NEXT_N_INSN macro in a relatively inefficient manner. The design could also introduce some cache locality issues two for the discontinuous memory spaces, but we may be able to adjust BLOCK_POOL_SIZE to trade-off for this.

src/emulate.c

+/* FIXME: This will simply find the n-th instruction by iterating
+ * the linked list linearly, we may want to find better approach. */


RinHizakura · 2023-09-26T04:07:04Z

This changes the method of accessing the instructions, so carefully testing to make sure everything has changed correctly is important. But as #228 reports, we have some issues with the compliance test now.

qwe661234 · 2023-09-26T06:24:27Z

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

jserv · 2023-09-26T06:58:33Z

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

We should prioritize the implementation of the interpreter as the foundation for further optimizations, such as a memory pool. The wip/jit branch is for development purposes, and we should merge our efforts back into the master branch when possible.

RinHizakura · 2023-09-26T10:14:17Z

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

It'll be better if there's a less-modified solution, but I suspect that whether mpool can address the issue properly. As far as I know, mpool is useful for the chunk of memory which will be allocated and free frequently. However, for rv_insn_t in each block, they will keep existing after being allocated until closing the emulator. Also, it looks like mpool will still allocate a fixed large chunk for each block, which still introduces unused memory for each block.

On the other hand, this PR is tried to give "just enough" size for each block. Although we can still allocate too much memory, by expanding the memory according to the need of rv_insn_t instead of block_t, I believe this solution is possible to save more space.

Of course, I may have to do more experiments to make sure I am correct.

src/emulate.c

src/riscv.c

src/riscv_private.h

jserv

Rebase the latest master branch and utilize FORCE_INLINE macro.

The original memory allocation strategy for instruction blocks was found to be inefficient, leading to excessive memory usage. In the previous approach, a fixed amount of memory was allocated for each block, resulting in significant wastage. To address this issue, we have implemented a more efficient memory allocation scheme. Instead of allocating a fixed size for each block, we now maintain a pool of rv_insn_t and allocate memory only when needed. This new approach minimizes heap allocations and optimizes memory usage. We have introduced a parameter, BLOCK_POOL_SIZE, which allows us to control the balance between the number of calloc calls and memory consumption. This flexibility ensures that memory allocation occurs only when the pool is depleted. As a result of these changes, the heap memory allocation has significantly improved. For example, in the puzzle.elf example, we observed a reduction in heap memory allocation from 20,306,989 bytes to just 313,461 bytes. While this design may lead to some discontinuity in memory spaces for instructions in sequence, the impact on random access is minimal, as random access is primarily required for certain fuse operations. In cases where random access is needed, we can employ linear search method. The potential cache locality issues resulting from the discontinuous memory spaces can also be mitigated by adjusting the BLOCK_POOL_SIZE parameter for better performance.

RinHizakura force-pushed the master branch from f0fcaa3 to 54f986f Compare September 26, 2023 04:02

github-advanced-security bot found potential problems Sep 26, 2023

View reviewed changes

jserv requested a review from qwe661234 September 26, 2023 04:52

qwe661234 mentioned this pull request Sep 27, 2023

Remove useless IR when fusing instructions #234

Merged

RinHizakura force-pushed the master branch from 54f986f to 3affb02 Compare September 27, 2023 14:45

jserv reviewed Sep 27, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

RinHizakura force-pushed the master branch 3 times, most recently from fe5c2a1 to babc35f Compare September 30, 2023 05:49

jserv reviewed Sep 30, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Sep 30, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

RinHizakura force-pushed the master branch from babc35f to bb937ea Compare October 1, 2023 03:25

jserv reviewed Oct 1, 2023

View reviewed changes

src/emulate.c Outdated Show resolved Hide resolved

jserv reviewed Oct 1, 2023

View reviewed changes

src/riscv.c Outdated Show resolved Hide resolved

jserv reviewed Oct 1, 2023

View reviewed changes

src/riscv_private.h Outdated Show resolved Hide resolved

jserv requested changes Oct 2, 2023

View reviewed changes

Reduce memory usage for instruction block

012f117

RinHizakura force-pushed the master branch from bb937ea to 012f117 Compare October 2, 2023 10:05

jserv merged commit 0031224 into sysprog21:master Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage for instruction block #232

Reduce memory usage for instruction block #232

RinHizakura commented Sep 26, 2023

RinHizakura commented Sep 26, 2023

qwe661234 commented Sep 26, 2023

jserv commented Sep 26, 2023

RinHizakura commented Sep 26, 2023 •

edited

Loading

jserv left a comment

		/* FIXME: This will simply find the n-th instruction by iterating
		* the linked list linearly, we may want to find better approach. */

Reduce memory usage for instruction block #232

Reduce memory usage for instruction block #232

Conversation

RinHizakura commented Sep 26, 2023

RinHizakura commented Sep 26, 2023

qwe661234 commented Sep 26, 2023

jserv commented Sep 26, 2023

RinHizakura commented Sep 26, 2023 • edited Loading

jserv left a comment

Choose a reason for hiding this comment

RinHizakura commented Sep 26, 2023 •

edited

Loading