-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage for instruction block #232
Conversation
This changes the method of accessing the instructions, so carefully testing to make sure everything has changed correctly is important. But as #228 reports, we have some issues with the compliance test now. |
I have implemented this without modifing the large number of code in this branch. After merging branch |
We should prioritize the implementation of the interpreter as the foundation for further optimizations, such as a memory pool. The |
It'll be better if there's a less-modified solution, but I suspect that whether On the other hand, this PR is tried to give "just enough" size for each block. Although we can still allocate too much memory, by expanding the memory according to the need of Of course, I may have to do more experiments to make sure I am correct. |
fe5c2a1
to
babc35f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest master
branch and utilize FORCE_INLINE
macro.
The original memory allocation strategy for instruction blocks was found to be inefficient, leading to excessive memory usage. In the previous approach, a fixed amount of memory was allocated for each block, resulting in significant wastage. To address this issue, we have implemented a more efficient memory allocation scheme. Instead of allocating a fixed size for each block, we now maintain a pool of rv_insn_t and allocate memory only when needed. This new approach minimizes heap allocations and optimizes memory usage. We have introduced a parameter, BLOCK_POOL_SIZE, which allows us to control the balance between the number of calloc calls and memory consumption. This flexibility ensures that memory allocation occurs only when the pool is depleted. As a result of these changes, the heap memory allocation has significantly improved. For example, in the puzzle.elf example, we observed a reduction in heap memory allocation from 20,306,989 bytes to just 313,461 bytes. While this design may lead to some discontinuity in memory spaces for instructions in sequence, the impact on random access is minimal, as random access is primarily required for certain fuse operations. In cases where random access is needed, we can employ linear search method. The potential cache locality issues resulting from the discontinuous memory spaces can also be mitigated by adjusting the BLOCK_POOL_SIZE parameter for better performance.
The original strategy for the allocation of instruction blocks is waste of memory. For every single block, we always create a space that can contain (1 << 10)
rv_insn_t
. However, we usually don't have that much instruction in one block in most of the case.We can simply know the heap usage by using valgrind. We can see that 20,306,989 bytes are allocated on the run of
puzzle.elf
for the old design.To address the issue, we can simply maintain a pool of
rv_insn_t
, and take only the required numbers ofrv_insn_t
space from it. This ensures heap allocation only happens when the pool is out ofrv_insn_t
. By using the parameterBLOCK_POOL_SIZE
, we have the flexibility to get the balance between the numbers ofcalloc
calls and the memory usage.In this way, we have great improvement for the heap memory allocation. As the following result, only 313,461 bytes are allocated on the
puzzle.elf
example.Because two instructions in sequence may now be separated into two discontinuous memory spaces, a drawback of this design is the cost of random access to the instruction. It seems that we only need random access to the instructions for some fuse operations, I think this might not be a big problem. We can still linear search that instruction with
GET_NEXT_N_INSN
macro in a relatively inefficient manner. The design could also introduce some cache locality issues two for the discontinuous memory spaces, but we may be able to adjustBLOCK_POOL_SIZE
to trade-off for this.