Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce basic block #91

Merged

Conversation

Risheng1128
Copy link
Collaborator

This commit introduces the basic block in emulator, meaning that it makes emulator decode and execute numerous instructions at a time.

Complete the first requirement in #88 .

TODO:

  1. Implementing an efficient way to look in a hash map for a code block matching the current program counter as wip/jit does;
  2. Reducing IR dispatch cost means of computed-goto.

@jserv
Copy link
Contributor

jserv commented Nov 13, 2022

I merged commit 2aa7154 and let's concentrate on basic block part.

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from 569c378 to 7fe0c6a Compare November 14, 2022 23:27
@jserv
Copy link
Contributor

jserv commented Nov 15, 2022

Instead of decoding and interpreting individual instructions in a loop, as with an interpreting emulator, an attempt is made to combine entire blocks that usually end with a branch/jump instruction. By means of the JIT frameworks such as MIR, native instructions corresponding to the block are generated and executed at the first point in time at which a memory address is jumped to.

Jumps are not executed directly, but instead the jump target is saved and the generated function is exited with RET. This allows the runtime environment to first compile the block at the jump target and perform other parallel tasks, including interrupt, input, and peripheral emulation.

During the compilation of a program block, the number of RISC-V clock cycles required up to this point is calculated for each possible end over which the block can be exited, and this sum is added to an instruction counter during execution. By means of this counter, events that occur on the RISC-V at certain times, such as system timer, can be precisely timed despite the higher speed of the host platform. Since there may be routines in the emulated programs that are dependent on a fixed number of executed instructions in a certain period of time, the timers of the host system cannot be used without compatibility problems. Due to the block-wise execution, however, there is also the problem with the emulator presented here that interrupts or timers are only executed or updated a few clock pulses late - after the next jump.

@Risheng1128, you shall confirm that using queue-based block management meets the requirements outlined above. In particular, when an exception/interrupt occurs, the proposed emulator can jump to the specified block.

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from 7fe0c6a to b7f09eb Compare November 19, 2022 23:39
@Risheng1128
Copy link
Collaborator Author

Risheng1128 commented Nov 19, 2022

Use queue-based block management to manage the instruction decode and execution.

In decode stage, allocate a new memory block, put the decoded instruction into the block, and record the order of instructions by the member front in struct block_t until the queue is full or the latest instruction is a branch instruction. The default size of memory block in queue is 1024.

In execution stage, when the queue is not empty, emulator executes the memory block that index is rear in struct block_t, frees the memory block and increments the clock cycle finally.

In particular, when an exception/interrupt occurs, emulator will do the following steps:

  1. Execute the exception/interrupt handler that resets a new program counter from the register mtvec and function
    emulate return false.
  2. Enter to the decode stage again, and create new memory blocks based on the new program counter.
    That is, emulator will dischard old memory blocks and create the new one from new program counter.

@jserv
Copy link
Contributor

jserv commented Nov 20, 2022

TODO: use perf to figure out the performance hotspot between master branch (the "classic" one) and the proposed changes. We shall get acquainted with the causes of the slowdowns.

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from b7f09eb to cae22ec Compare November 20, 2022 06:08
@Risheng1128
Copy link
Collaborator Author

Let emulator executes coremark.elf and use perf to observe the hotspot of branch master and wip/instruction-decode respectively.

branch master:

Overhead    Command    Shared Object   Symbol
52.99%      rv32emu    rv32emu         [.] rv_step
9.80%       rv32emu    rv32emu         [.] memory_ifetch
8.69%       rv32emu    rv32emu         [.] on_mem_ifetch
3.88%       rv32emu    rv32emu         [.] rv_userdata
2.98%       rv32emu    rv32emu         [.] memory_write
1.98%       rv32emu    rv32emu         [.] main
1.40%       rv32emu    rv32emu         [.] memory_read_w
0.82%       rv32emu    rv32emu         [.] memory_read_s
0.73%       rv32emu    rv32emu         [.] on_mem_write_w
0.72%       rv32emu    rv32emu         [.] rv_has_halted
0.55%       rv32emu    rv32emu         [.] on_mem_read_w
0.48%       rv32emu    rv32emu         [.] on_mem_read_s
...

branch wip/instruction-decode:

Overhead    Command    Shared Object   Symbol
39.95%      rv32emu    rv32emu         [.] rv_step
10.52%      rv32emu    rv32emu         [.] rv_decode
6.59%       rv32emu    rv32emu         [.] memory_ifetch
6.09%       rv32emu    rv32emu         [.] op_op_imm
5.71%       rv32emu    rv32emu         [.] on_mem_ifetch
3.85%       rv32emu    rv32emu         [.] op_branch
2.35%       rv32emu    rv32emu         [.] rv_userdata
2.29%       rv32emu    rv32emu         [.] op_op
2.27%       rv32emu    rv32emu         [.] op_load
1.99%       rv32emu    rv32emu         [.] memory_write
0.99%       rv32emu    rv32emu         [.] memory_read_w
0.92%       rv32emu    rv32emu         [.] op_store
0.51%       rv32emu    rv32emu         [.] memory_read_s
0.51%       rv32emu    rv32emu         [.] on_mem_write_w
0.50%       rv32emu    rv32emu         [.] on_mem_read_w
...

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from cae22ec to fcce814 Compare November 20, 2022 06:55
@jserv
Copy link
Contributor

jserv commented Nov 20, 2022

Let emulator executes coremark.elf and use perf to observe the hotspot of branch master and wip/instruction-decode respectively.

If a block has been translated, we would not throw it away after emulating the instructions within; instead, we'll save it for future lookups.

wip/instruction-decode is indeed spending more time on decoding + emulating than master branch. It results from the inefficient block translation which prevents block reusability from being taken.

@jserv
Copy link
Contributor

jserv commented Nov 20, 2022

wip/instruction-decode is indeed spending more time on decoding + emulating than master branch. It results from the inefficient block translation which prevents block reusability from being taken.

Check wip/jit and refactor-rv32 for block prediction. We should attempt to predict the next block, resulting in known performance boost.

@LambertWSJ
Copy link
Collaborator

I recommend using a hash table and block prediction. it may be more convenient to implement the control flow graph in addition to having quick access to the basic block.

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from fcce814 to 177d998 Compare November 25, 2022 19:08
@lgtm-com
Copy link

lgtm-com bot commented Nov 25, 2022

This pull request introduces 1 alert and fixes 1 when merging 177d998 into 2aa7154 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

fixed alerts:

  • 1 for FIXME comment

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@Risheng1128
Copy link
Collaborator Author

Risheng1128 commented Nov 25, 2022

Remove the queue in block because we just need to know the number of instructions encompased and the first executed instruction in block is located at first position of IR array.

Using hash table and block prediction to manage the blocks. it cans avoid to decode the duplicate block and find the next block efficiently.

Finally, using coremark.elf and dhrystone.elf to measure the performance.

old:

coremark
---------------------------------------------------------------------
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 13939747
Total time (secs): 13.939747
Iterations/Sec   : 430.423881
Iterations       : 6000
Compiler version : GCC11.1.0
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  
Memory location  : Please put data memory location here
			(e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xa14c
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 430.423881 / GCC11.1.0 -O2 -DPERFORMANCE_RUN=1   / Heap

dhrystone
---------------------------------------------------------------------
Dhrystone(1.1-mc), 10000000 passes, 13554164 microseconds, 419 DMIPS

new:

coremark
---------------------------------------------------------------------
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 12069925
Total time (secs): 12.069925
Iterations/Sec   : 911.356119
Iterations       : 11000
Compiler version : GCC11.1.0
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  
Memory location  : Please put data memory location here
			(e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x33ff

dhrystone
---------------------------------------------------------------------
Dhrystone(1.1-mc), 10000000 passes, 5904499 microseconds, 961 DMIPS

The new implementation is about 2 ~ 3 times faster than old one.

@jserv jserv requested a review from RinHizakura November 25, 2022 22:37
@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from 177d998 to da35767 Compare November 26, 2022 16:12
@lgtm-com
Copy link

lgtm-com bot commented Nov 26, 2022

This pull request introduces 1 alert and fixes 1 when merging da35767 into 2aa7154 - view on LGTM.com

new alerts:

  • 1 for FIXME comment

fixed alerts:

  • 1 for FIXME comment

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@Risheng1128
Copy link
Collaborator Author

Retain block_map_clear in decode.h only since it is called by emulate.c and riscv.c.

@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from da35767 to 76a028a Compare November 27, 2022 08:38
This commit introduces the basic block in emulator, meaning that
it makes emulator decode and execute numerous instructions at a
time.

Use hash table and block prediction to manage blocks efficiently.

In decode stage, allocate a new block which contains up to 1024
instruction by default, decode the instruction into block until it
is full or the latest instruction is a branch instruction and put
it into the block map.

In execution stage, emulator executes instructions in block. The
number of instructions based on the member insn_num in struct block.

In particular, when an exception/interrupt occurs, emulator will
do the following steps:
1. Execute the exception/interrupt handler that resets a new program
   counter from the register mtvec and function emulate returns false.
2. Enter to the decode stage again, and create new block based on
   the new program counter.
That is, emulator will stop executing old block and create the new
one from new program counter.

On the other hand, the file decode.c includes the header file
riscv_private.h which includes the gdbstub file. It will make emulator
compile failed because the gdbstub is cloned until compiling emulate.c.
To resolve this problem, swapping the compile order between emulate.o
and decode.o .
@Risheng1128 Risheng1128 force-pushed the wip/instruction-decode branch from 76a028a to 856a3e9 Compare November 27, 2022 13:31
@jserv jserv merged commit bb63d93 into sysprog21:wip/instruction-decode Nov 28, 2022
vestata pushed a commit to vestata/rv32emu that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants