Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JALR execution with JIT-cache #471

Merged
merged 2 commits into from
Aug 7, 2024
Merged

Conversation

vacantron
Copy link
Collaborator

@vacantron vacantron commented Aug 1, 2024

Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the JIT-ed code entry and avoids the mode change.

There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up two times, and the latter like "Fibonacci", which compiled from the hand-written assembly, can even reach x4.3 performance enhencement.

perf-indirect-jump-improvement

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks

Benchmark suite Current: f5d04fb Previous: 2b492f4 Ratio
Dhrystone 1602.55 Average DMIPS over 10 runs 1575.66 Average DMIPS over 10 runs 0.98
Coremark 1407.431 Average iterations/sec over 10 runs 1413.569 Average iterations/sec over 10 runs 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place the source code of fibonacci.elf in directory tests. Always use lowercase filenames.

src/jit-cache.h Outdated Show resolved Hide resolved
@qwe661234
Copy link
Collaborator

qwe661234 commented Aug 2, 2024

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss. Maybe you can try to use the same way in T1C, that is parsing branch history table and including the predicted path into execution path for T2C.

@vacantron
Copy link
Collaborator Author

vacantron commented Aug 2, 2024

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss.

Yes, the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

@qwe661234
Copy link
Collaborator

qwe661234 commented Aug 2, 2024

the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

I see, your strategy significantly improves the performance of benchmark with large number of jalr instruction, such as qsort and fib. However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

@vacantron
Copy link
Collaborator Author

vacantron commented Aug 2, 2024

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

src/jit.h Outdated Show resolved Hide resolved
src/jit.h Outdated Show resolved Hide resolved
src/jit.h Outdated Show resolved Hide resolved
src/jit.h Outdated Show resolved Hide resolved
@jserv jserv requested a review from qwe661234 August 3, 2024 03:00
@qwe661234
Copy link
Collaborator

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

@jserv
Copy link
Contributor

jserv commented Aug 3, 2024

CI failure:

git submodule update --init tests/
Submodule 'riscv-arch-test' (https://github.com/riscv-non-isa/riscv-arch-test) registered for path 'tests/riscv-arch-test'
Cloning into '/home/runner/work/rv32emu/rv32emu/tests/riscv-arch-test'...
From https://github.com/riscv-non-isa/riscv-arch-test
 * branch              ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639 -> FETCH_HEAD
Submodule path 'tests/riscv-arch-test': checked out 'ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639'
Traceback (most recent call last):
  File "/home/runner/.local/bin/riscof", line 5, in <module>
    from riscof.cli import cli
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/cli.py", line 18, in <module>
    import riscof.framework.main as framework
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/framework/main.py", line 11, in <module>
    from riscv_isac.isac import preprocessing
ImportError: cannot import name 'preprocessing' from 'riscv_isac.isac' (/home/runner/.local/lib/python3.10/site-packages/riscv_isac/isac.py)
make: *** [mk/riscv-arch-test.mk:18: arch-test] Error 1

@vacantron
Copy link
Collaborator Author

CI failure

This seems to be the upstream issue (riscv-software-src/riscof#122) .

@jserv
Copy link
Contributor

jserv commented Aug 3, 2024

riscv-software-src/riscof#122

A temporary workaround:

pip3 install git+https://github.com/riscv/riscof.git@d38859f85fe407bcacddd2efcd355ada4683aee4

@jserv
Copy link
Contributor

jserv commented Aug 3, 2024

It is a bit confusing to have both block cache and jit cache in the same file. Can you clarify?

@vacantron
Copy link
Collaborator Author

vacantron commented Aug 4, 2024

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

After researching, the performance degradation occurred when the T2C executed the T1C cache directly and bypassed the profiler.

In the previous implementation, I added all entries of the T1C-generated code to the cache (and even the part from block-chaining), but the chained blocks might be generated by parsing the branch history table and not be the main control flow. If that cache is used by the T2C, the profiler will have no chance to execute again to tag the right potential hotspot, and I think this is the main reason of the performance degradation of IDEA .

After constraining the source of cache coming from the T1C, the performance is shown at the top and it becomes more consistent with expectations.

tests/fibonacci.s Outdated Show resolved Hide resolved
src/t2c.c Outdated Show resolved Hide resolved
src/t2c.c Outdated Show resolved Hide resolved
src/jit.h Show resolved Hide resolved
src/jit.h Outdated Show resolved Hide resolved
@jserv
Copy link
Contributor

jserv commented Aug 5, 2024

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

@vacantron
Copy link
Collaborator Author

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

Yes, but there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

@jserv
Copy link
Contributor

jserv commented Aug 6, 2024

there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

We can consider header refactoring when integrating with frameworks like (PHP) IR or similar compiler frameworks. However, that is not necessary at this moment. Let's keep changes to a minimum.

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework the git commits by merging the intermediate changes.

@jserv jserv added this to the release-2024.1 milestone Aug 7, 2024
Currently, the "JALR" indirect jump instruction turns the mode of
rv32emu from T2C back to the interpreter. This commit introduces a
"JIT-cache" table lookup to make it redirect to the T2C JIT-ed code
entry and avoid the mode change.

There are several scenarios benefitting from this approach, e.g.
function pointer invocation and far-way function call. The former
like "qsort" can be speeded up by two times, and the latter like
"fibonacci", which compiled from the hand-written assembly for
creating "JALR" instructions, can even reach x4.8 performance
enhencement.
@jserv jserv merged commit 34c3db0 into sysprog21:master Aug 7, 2024
8 checks passed
@jserv
Copy link
Contributor

jserv commented Aug 7, 2024

Thank @vacantron for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants