-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Techniques for writing interpreter loops #3
Comments
In this irlo forum discussion stocklund explained how clang and LLVM are able to optimize a It seems possible that rustc could be modified to do the same. I'm not sure about some of the reasoning in the post as to why this shouldn't be an issue anymore, because modern CPUs are good enough at branch prediction. My testing did show that low power CPU's branch predictors are still a way behind and that computed gotos continue to measurably outperform other methods there. |
@pliniker I imagine this being less of an issue on newer architectures, but I wouldn't be surprised if the problem persists on mobile/embedded devices (e.g. those running ARM). Even if it's no longer a problem I think it's worth documenting, including data to show how things behave. |
My dispatch experiments did show that ARM processors do much better with computed gotos. I also am curious to see how Spectre mitigation microcode updates will affect branch prediction - I had read suggestions that there would be a decrease in performance of high-end Intel CPUs branch predictors but I have no data on that yet. |
A small comment regarding the available techniques: if one uses Rust as a language backend, it means that its "safety" is one of the main features that the developer had in mind; therefore any solution that "breaks" this "safety", especially for such a core component of a language VM, practically nulls the Rust advantage. As such the only variants that make sense are those that leverage "native Rust", i.e. BTW: is there an example on how such a loop based on TCO would look like in Rust? And what is the performance advantage over the (Update: @pliniker 's article https://pliniker.github.io/post/dispatchers/ , also linked in the initial comment, provides such an example.) Regarding the
Without knowing the internals on how Rust compiles match down to assembler, I would assume that the first approach (i.e. one single For example in my interpreter I decided to go with the second approach due to readability.
However after doing that refactoring (i.e. transforming the 100 As related to writing interpreter loops, there are two additional topics:
|
The interpreter loop may be a core component, but it's also quite self contained and easily testable. Imo it'd be a prime example where a bit of unsafe code is justified. |
I don't debate that the "loop" might be easily tested and formally checked, however you are taking one thing for granted: that the input (i.e. the "bytecode") is "correct" (i.e. by the "loop" specifications). Thus you are ignoring one major component which might not be as simple to test and check: the "compiler" and then the "optimizer" (or any other component that creates the "bytecode"). Think about this: I bet someone has formally proved that, say a subset of, a given real assembler language is "correct", and therefore one might expect that this property translates to the "machine code" being run. Which in practice doesn't happen, so given the excessive number of buffer over- and under-runs... |
Maybe i overlook something, but afaik this is not really a problem. There is a lookup from bytecode to function pointer. If this dispatch is correct and the functions that are jumped are correctly implemented, we're back at safe operations. The only "attack" that would be possible from the bytecode is sending an invalid opcode where for which there is no function pointer in the table. But that is not hard or expensive to verify beforehand. It also might be possible (if we use only a byte-range for the opcode) to just fill up the table with halt commands. I've downladed @plinikers dispatch crate and played with it a little bit. The simple match loop variant is... special. With the right amount of useless match-arms it really performs quite well. But i've only a haswell laptop to test on. |
If it helps: Inko's VM loop uses some let instructions = X
let mut index = 0
loop {
let instruction = instructions[index]
match instruction.instruction_type {
...
}
index += 1
} The In case of Inko I went with the assumption that the bytecode is valid by the time we run it, and there's some basic validation going on when parsing bytecode (enough that we can't at least end up with out of bounds array access). I have been meaning to add a more in depth bytecode verifier at some point, but its priority is super low. A slightly safer alternative would be: let instructions = X
let mut index = 0
while index < instructions.len() {
let instruction = instructions[index]
match instruction.instruction_type {
...
}
index += 1
} I however found that in Inko this added some unnecessary overhead, so I went with the first snippet. When documenting interpreter techniques I think it's better if we err on the side of safety / less performance, but clearly state how one might be able to further optimise the code (possibly at the cost of safety). This way if somebody blindly copy-pastes the code they don't expose themselves to all kinds of nasty issues. |
I've a question regarding JIT compilation: My (probably wrong and/or simplistic) understanding is that one builds bits off assembly associated with each op-code and than unrolls the interpreter loop by copying these bits off assembly for each op-code into a buffer. If that is the case, would it be possible to unify that with the "loop + inline assembly" approach? If we have to hardcode each operation in asm anyway, why not reuse it for the interpreter loop? |
For interpreters one of the core components is the interpreter/dispatcher loop. In it's most basic way this is essentially a
loop
combined with a (giant)match
, but depending on the language there are other techniques available. I think it's worth collecting the various approaches available to Rust, their benefits, drawbacks, etc. Over time we should probably extend this into a document outlining this more clearly, including examples and what not.@pliniker wrote a bit about this in https://pliniker.github.io/post/dispatchers/, and I discussed things a bit in https://www.reddit.com/r/rust/comments/66h3t2/alternatives_to_dispatching_using_loop_match/.
Available Techniques
loop
+match
match
isn't 5 000 lines longloop
plus some form of inline assemblygoto
break
in a function won't work). This means that to control the loop you'd still need some kind ofmatch
There are probably more, but these are the ones I can think of at the moment.
The text was updated successfully, but these errors were encountered: