-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing local register allocation for the tier-1 JIT compiler #341
Conversation
Prior to our review of this pull request, please provide a detailed explanation of the register allocation (RA) algorithm you've implemented, including the rationale behind your choice. It would also be beneficial to discuss how this approach compares with traditional algorithms like graph coloring and linear scan, highlighting the advantages and considerations of each. Referece: |
I implemented register allocation using an available host register table, a counter, and a VM register table.
Take the instruction sequence below as an example:
In the first In the second |
5619dc8
to
9b56e8d
Compare
9582ec5
to
3498b40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest master
branch for API refinement.
9425612
to
f9b9e0d
Compare
Provide benchmarks for both x86-64 and Aarch64. |
if (reg_table[i] == target_reg) { | ||
reg_table[i] = -1; | ||
emit_store(state, S32, target_reg, parameter_reg[0], | ||
offsetof(riscv_t, X) + 4 * i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do an early return here? The target should be unique in the reg_table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do early return here.
Added |
Refine git commit message with correct column description -- |
Sure, I explained in the git message. |
In the realm of compiler optimizations, register allocation strategies are broadly categorized into Local, Global, and Interprocedural register allocation. The proposed change specifically pertains to local register allocation. This approach focuses on tracking register contents within a basic block -- a linear sequence of instructions -- allowing for the efficient reuse of variables and constants directly from registers. This detail should be clearly communicated in the context of both pull request and git commit messages to ensure clarity regarding the scope and nature of the register allocation. I defer to @vacantron for confirmation. |
Local register allocation effectively reuses the host register value within a basic block scope, thereby reducing the number of load and store instructions. Take continuous addi instructions as an example: addi t0, t0, 1 addi t0, t0, 1 addi t0, t0, 1 * The generated machine code without register allocation load t0, t0_addr add t0, 1 sw t0, t0_addr load t0, t0_addr add t0, 1 sw t0, t0_addr load t0, t0_addr add t0, 1 sw t0, t0_addr * The generated machine code without register allocation load t0, t0_addr add t0, 1 add t0, 1 add t0, 1 sw t0, t0_addr As shown in the above example, register allocation reuses the host register and reduces the number of load and store instructions. * x86-64(i7-11700) | Metric | W/O RA | W/ RA | SpeedUp | |----------+----------+----------+---------| | dhrystone| 0.342 s | 0.328 s | +4.27% | | miniz | 1.243 s | 1.185 s | +4.89% | | primes | 1.716 s | 1.689 s | +1.60% | | sha512 | 2.063 s | 1.880 s | +9.73% | | stream |11.619 s |11.419 s | +1.75% | * Aarch64 (eMag) | Metric | W/O RA | W/ RA | SpeedUp | |----------+----------+----------+---------| | dhrystone| 1.935 s | 1.301 s | +48.73% | | miniz | 7.706 s | 4.362 s | +76.66% | | primes |10.513 s | 9.633 s | +9.14% | | sha512 | 6.508 s | 6.119 s | +6.36% | | stream |45.174 s |38.037 s | +18.76% | As demonstrated in the performance analysis, the register allocation improves the overall performance for the T1C generated machine code. Without RA, the generated machine need to store back the register value in the end of intruction. With RA, we only need to store back the register value in the end of basic block or when host registers are fully occupied. The performance enhancement is particularly pronounced on Aarch64 due to its increased availability of registers, providing a more extensive mapping capability for VM registers.
Local register allocation effectively reuses the host register value with in a basic block scope, thereby reducing the number of load and store instructions.
Take continuous addi instructions as an example:
As shown in the above example, register allocation reuses the host register and reduces the number of load and store instructions.
As demonstrated in the performance analysis, the register allocation improves the overall performance for the T1C generated machine code. Without RA, the generated machine need to store back the register value in the end of intruction. With RA, we only need to store back theregister value in the end of basic block or when host registers are fully occupied. The performance enhancement is particularly pronounced on Aarch64 due to its increased availability of registers, providing a more extensive mapping capability for VM registers.