-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce overhead of indirect branch on AArch64 #2390
Labels
Comments
fhahn
added a commit
that referenced
this issue
Jul 2, 2018
movk reg, #0 is unnecessary, but currently the code expects a fixed size sequence at a few places, so we insert NOPs instead. Issue #2390 Change-Id: I7829044cd91cacdb42e4e9c74dc25d59bcc0619f
fhahn
added a commit
that referenced
this issue
Jul 2, 2018
Instead of creating unnecessary movk instructions when creating exit stubs, we insert nops instead just before the branch. Ideally we would just have a variable size for the exit stub fragments, but at different places DynamoRIO expects them the be of a fixed size and the last instruction to be a branch. Issue #2390 Change-Id: If798af4158d591dd3a7d522ee0118daadc5a1c89
fhahn
added a commit
that referenced
this issue
Jul 3, 2018
'movk reg, #0' is unnecessary, but currently the code expects a fixed size sequence at a few places, so we insert NOPs instead. Issue #2390
The unnecessary movk instructions are not emitted anymore. I think going forward it would be good to have some (automatic) performance tracking, to make sure we get the desired benefit. |
This was referenced Jul 18, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It is well known that indirect branches are a major factor in the performance of dynamic binary translation, optimisation and instrumentation (see, for example, "Optimizing Indirect Branches in Dynamic Binary Translators", 2016), although in the case of dynamic binary instrumentation the overhead of instrumentation may dominate everything else.
DynamoRIO's handling of indirect branches on AArch64 is currently rather inefficient; we have so far not paid much attention to performance. This issue documents the current implementation of indirect branches and lists a few improvements that may be worth implementing. Some of these things may also be applicable to other architectures.
The sequence of instructions executed under DynamoRIO to simulate a single RET instruction in the original application typically looks like this:
Things to do:
Probably the biggest component of the overhead comes from the linear probing in the hash table: the value we want is in the hash table, but we use a linear search to find it and in this case it is the fourth value that we examine. On a low-performance CPU the loads may be expensive, and on a high-performance CPU the conditional branch at 0x4c52141c is likely to be unpredictable unless the branch predictor is very clever. The values we skip over came from C library start-up code: probably we will never use them again. A bigger hash table would help, but it would have to be much bigger. Other things that might help: flushing old values from the table, or flushing the whole table; or arranging for values added most recently to be found first. But those would be global changes to how the hash tables work.
Since AArch64 instructions are aligned, ibl_hash_func_offset should probably equal 2, and ibl_indcall_hash_offset at least 2.
The performance could perhaps be significantly improved just by adjusting some option defaults (in core/optionsx.h).
Are we using different hash tables for BR, BLR, and RET? Should we?
The AArch64 hash table look-up code could be tested and measured independently of DynamoRIO and easily replaced without changing anything else so it looks like "low-hanging fruit". Unrolling the loop in the look-up code might help on some hardware but would probably be harmful on other hardware. However, using a different conditional branch for the first match check than for subsequent match checks might be generally helpful as the probability of a match could be significantly different in the two cases. The app and translated addresses could be loaded together with LDP. One could avoid the direct branch in the loop, but ideally the first value we look at would be the right one so we would not loop at all so perhaps that is the path to be optimised primarily. We could avoid one (predictable) conditional branch (for cases we care about) by checking for a match before we check for zero.
There are unnecessary loads, stores and branches (including an unnecessarily indirect one) before and after the hash table look-up. Since the argument to an indirect branch is most often X30 perhaps that register should be used as the argument to the hash-table look-up code. Having X30 as the register to be restored in the fragment prefix might help, too, in general, as one could then branch to the fragment with BLR instead of BR. There is a global register allocation problem to be solved here.
It would be nice to translate BR to BR and RET to RET so as to help the CPU's branch predictor. This is feasible but it would be a big change to DynamoRIO: you would also have to translate BL to BL, which means putting the correct value into X30 after the BL.
Having all the app's indirect branches go though a single indirect branch (in this case at 0x4c521434) makes that branch very difficult to predict. Rather than jump to code that does the look-up and the branch, generally it is better to call a function that does the look-up and do the indirect branch after returning, so there is a separate BR/RET in the translation cache for each BR/RET in the original code. Could this be implemented on AArch64 without changing the way DynamoRIO works on other architectures?
There are other standard optimisations described in the literature but they are mostly architecture-independent and would require additional complexity: adding alternative methods for handling an indirect branch rather than just improving the current universal method (a shared hash table).
Here is what the sequence of instructions to simulate a RET using a hash table could perhaps look like in the best case (untested):
The text was updated successfully, but these errors were encountered: