Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for instruction block #232

Merged
merged 1 commit into from
Oct 2, 2023

Conversation

RinHizakura
Copy link
Collaborator

The original strategy for the allocation of instruction blocks is waste of memory. For every single block, we always create a space that can contain (1 << 10) rv_insn_t. However, we usually don't have that much instruction in one block in most of the case.

We can simply know the heap usage by using valgrind. We can see that 20,306,989 bytes are allocated on the run of puzzle.elf for the old design.

$ valgrind ./build/rv32emu ./build/puzzle.elf
==62803== Memcheck, a memory error detector
==62803== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==62803== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==62803== Command: ./build/rv32emu ./build/puzzle.elf
==62803== 
==62803== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (defined)
success in 2005 trials
inferior exit code 0
==62803== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (noaccess)
==62803== 
==62803== HEAP SUMMARY:
==62803==     in use at exit: 39,124 bytes in 259 blocks
==62803==   total heap usage: 1,281 allocs, 1,022 frees, 20,306,989 bytes allocated
==62803== 
==62803== LEAK SUMMARY:
==62803==    definitely lost: 0 bytes in 0 blocks
==62803==    indirectly lost: 0 bytes in 0 blocks
==62803==      possibly lost: 0 bytes in 0 blocks
==62803==    still reachable: 37,108 bytes in 238 blocks
==62803==         suppressed: 0 bytes in 0 blocks
==62803== Rerun with --leak-check=full to see details of leaked memory
==62803== 
==62803== For lists of detected and suppressed errors, rerun with: -s
==62803== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

To address the issue, we can simply maintain a pool of rv_insn_t, and take only the required numbers of rv_insn_t space from it. This ensures heap allocation only happens when the pool is out of rv_insn_t. By using the parameter BLOCK_POOL_SIZE, we have the flexibility to get the balance between the numbers of calloc calls and the memory usage.

In this way, we have great improvement for the heap memory allocation. As the following result, only 313,461 bytes are allocated on the puzzle.elf example.

$ valgrind ./build/rv32emu ./build/puzzle.elf
==62927== Memcheck, a memory error detector
==62927== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==62927== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==62927== Command: ./build/rv32emu ./build/puzzle.elf
==62927== 
==62927== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (defined)
success in 2005 trials
inferior exit code 0
==62927== Warning: set address range perms: large range [0x59c87000, 0x159c87000) (noaccess)
==62927== 
==62927== HEAP SUMMARY:
==62927==     in use at exit: 39,124 bytes in 259 blocks
==62927==   total heap usage: 940 allocs, 681 frees, 313,461 bytes allocated
==62927== 
==62927== LEAK SUMMARY:
==62927==    definitely lost: 0 bytes in 0 blocks
==62927==    indirectly lost: 0 bytes in 0 blocks
==62927==      possibly lost: 0 bytes in 0 blocks
==62927==    still reachable: 37,108 bytes in 238 blocks
==62927==         suppressed: 0 bytes in 0 blocks
==62927== Rerun with --leak-check=full to see details of leaked memory
==62927== 
==62927== For lists of detected and suppressed errors, rerun with: -s
==62927== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Because two instructions in sequence may now be separated into two discontinuous memory spaces, a drawback of this design is the cost of random access to the instruction. It seems that we only need random access to the instructions for some fuse operations, I think this might not be a big problem. We can still linear search that instruction with GET_NEXT_N_INSN macro in a relatively inefficient manner. The design could also introduce some cache locality issues two for the discontinuous memory spaces, but we may be able to adjust BLOCK_POOL_SIZE to trade-off for this.

Comment on lines +400 to +402
/* FIXME: This will simply find the n-th instruction by iterating
* the linked list linearly, we may want to find better approach. */

Check notice

Code scanning / CodeQL

FIXME comment

FIXME comment: This will simply find the n-th instruction by iterating \[...\]
@RinHizakura
Copy link
Collaborator Author

This changes the method of accessing the instructions, so carefully testing to make sure everything has changed correctly is important. But as #228 reports, we have some issues with the compliance test now.

@jserv jserv requested a review from qwe661234 September 26, 2023 04:52
@qwe661234
Copy link
Collaborator

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

@jserv
Copy link
Contributor

jserv commented Sep 26, 2023

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

We should prioritize the implementation of the interpreter as the foundation for further optimizations, such as a memory pool. The wip/jit branch is for development purposes, and we should merge our efforts back into the master branch when possible.

@RinHizakura
Copy link
Collaborator Author

RinHizakura commented Sep 26, 2023

I have implemented this without modifing the large number of code in this branch. After merging branch wip/jit back to master, the interpreter's memory usage will be reduced.

It'll be better if there's a less-modified solution, but I suspect that whether mpool can address the issue properly. As far as I know, mpool is useful for the chunk of memory which will be allocated and free frequently. However, for rv_insn_t in each block, they will keep existing after being allocated until closing the emulator. Also, it looks like mpool will still allocate a fixed large chunk for each block, which still introduces unused memory for each block.

On the other hand, this PR is tried to give "just enough" size for each block. Although we can still allocate too much memory, by expanding the memory according to the need of rv_insn_t instead of block_t, I believe this solution is possible to save more space.

Of course, I may have to do more experiments to make sure I am correct.

@RinHizakura RinHizakura force-pushed the master branch 3 times, most recently from fe5c2a1 to babc35f Compare September 30, 2023 05:49
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest master branch and utilize FORCE_INLINE macro.

@jserv jserv merged commit 0031224 into sysprog21:master Oct 2, 2023
vestata pushed a commit to vestata/rv32emu that referenced this pull request Jan 24, 2025
The original memory allocation strategy for instruction blocks was found
to be inefficient, leading to excessive memory usage. In the previous
approach, a fixed amount of memory was allocated for each block, resulting
in significant wastage.

To address this issue, we have implemented a more efficient memory
allocation scheme. Instead of allocating a fixed size for each block, we
now maintain a pool of rv_insn_t and allocate memory only when needed.
This new approach minimizes heap allocations and optimizes memory usage.

We have introduced a parameter, BLOCK_POOL_SIZE, which allows us to control
the balance between the number of calloc calls and memory consumption. This
flexibility ensures that memory allocation occurs only when the pool is
depleted.

As a result of these changes, the heap memory allocation has significantly
improved. For example, in the puzzle.elf example, we observed a reduction in
heap memory allocation from 20,306,989 bytes to just 313,461 bytes.

While this design may lead to some discontinuity in memory spaces for
instructions in sequence, the impact on random access is minimal, as random
access is primarily required for certain fuse operations. In cases where
random access is needed, we can employ linear search method. The potential
cache locality issues resulting from the discontinuous memory spaces can
also be mitigated by adjusting the BLOCK_POOL_SIZE parameter for better
performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants