-
Notifications
You must be signed in to change notification settings - Fork 69
feat(nightly): execution become
s faster
#2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`become` keyword was causing some corruption and not passing references properly between calls.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
CodSpeed WallTime Performance ReportMerging #2013 will improve performances by 95.21%Comparing
|
Benchmark | BASE |
HEAD |
Change | |
---|---|---|---|---|
⚡ | benchmark_execute[bubblesort] |
33.1 ms | 21.6 ms | +53.62% |
⚡ | benchmark_execute[factorial_iterative_u256] |
150.6 ms | 84.6 ms | +78.03% |
⚡ | benchmark_execute[fibonacci_iterative] |
48.8 ms | 26.5 ms | +84.06% |
⚡ | benchmark_execute[fibonacci_recursive] |
75.7 ms | 38.8 ms | +95.21% |
⚡ | benchmark_execute[keccak256] |
38.3 ms | 24.7 ms | +55.16% |
⚡ | benchmark_execute[keccak256_iter] |
126.4 ms | 83.7 ms | +51.09% |
⚡ | benchmark_execute[pairing] |
148.9 ms | 134.8 ms | +10.45% |
⚡ | benchmark_execute[quicksort] |
37.4 ms | 23.6 ms | +58.44% |
⚡ | benchmark_execute[revm_transfer] |
53.6 ms | 39.2 ms | +36.76% |
⚡ | benchmark_execute[sha256] |
37.7 ms | 23.2 ms | +62.73% |
⚡ | benchmark_execute[sha256_iter] |
109.6 ms | 60.1 ms | +82.4% |
⚡ | benchmark_execute_metered[bubblesort] |
67.7 ms | 44.4 ms | +52.41% |
⚡ | benchmark_execute_metered[factorial_iterative_u256] |
252.8 ms | 210.8 ms | +19.91% |
⚡ | benchmark_execute_metered[fibonacci_recursive] |
109.1 ms | 91.6 ms | +19.09% |
⚡ | benchmark_execute_metered[keccak256_iter] |
179.6 ms | 147.9 ms | +21.4% |
⚡ | benchmark_execute_metered[quicksort] |
71.4 ms | 50.1 ms | +42.53% |
⚡ | benchmark_execute_metered[revm_transfer] |
66.7 ms | 60.1 ms | +11.06% |
⚡ | benchmark_internal_verifier_execute[fibonacci] |
46.4 ms | 40.4 ms | +14.68% |
98988d9
to
9d9aa8a
Compare
become
s faster
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
3505ba2
to
64c0571
Compare
This comment has been minimized.
This comment has been minimized.
4a1e5e8
to
e066de4
Compare
e066de4
to
ce7c037
Compare
This comment has been minimized.
This comment has been minimized.
return; | ||
} | ||
// exec_state.pc should have been updated by execute_impl at this point | ||
let next_handler = interpreter.get_handler(exec_state.vm_state.pc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both get_handler
and get_pre_compute
call get_pc_index
. maybe we can calculate the index once and reuse it
or add a get_pre_handler_and_pre_compute
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compiler should optimize this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually this is the new pc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still think there's duplicate work happening here that won't be optimized out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you mean like because new pc doesn't get passed across function boundary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
} | ||
// exec_state.pc should have been updated by execute_impl at this point | ||
let next_handler = interpreter.get_handler(exec_state.vm_state.pc); | ||
if next_handler.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not required rn but maybe we can just return a null pointer instead of Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again I hope the compiler just optimizes this out
let handler_fn = quote! { | ||
#[inline(never)] | ||
unsafe fn #handler_name #handler_generics ( | ||
interpreter: &::openvm_circuit::arch::interpreter::InterpretedInstance<#f_type, #ctx_type>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it matters but maybe we just pass like a slimmed down instruction table that contains a function/mapping from pc -> (data, handler) instead of a reference to this whole struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea will do with the register pinning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. there's an argument that we can have a single handler function instead of separate precompute
and handler
functions but i think the current approach with using macros to avoid duplication is fine too
/// `pc_index = (pc - pc_base) / DEFAULT_PC_STEP`. | ||
/// `pc_index = pc / DEFAULT_PC_STEP`. | ||
/// SAFETY: The first `pc_base / DEFAULT_PC_STEP` entries will be unreachable. We do this to | ||
/// avoid needing to subtract `pc_base` during runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is pc_base
guaranteed to be bounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it's bounded by the ELF size
} | ||
#[cfg(feature = "tco")] | ||
{ | ||
tracing::debug!("execute_tco"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly pedantic, but i feel the tail-recursive function is more like a "trampoline"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I thought (but didn't validate) that other interpreters refer to trampoline as the standard one?
return; | ||
} | ||
// exec_state.pc should have been updated by execute_impl at this point | ||
let next_handler = interpreter.get_handler(exec_state.vm_state.pc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still think there's duplicate work happening here that won't be optimized out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing else besides Ayush's comments. LGTM if the comments are addressed
Commit: aee66b1 |
Uses the nightly
become
keyword to tell LLVM tomusttail
. This did not work at first when theHandler
type returnedResult
but after I changed it to return nothing it does.Ideas taken from https://github.com/xacrimon/tcvm/blob/main/src/interp.rs
To remove code duplication, I created local declarative macros
dispatch!
for any "enum dispatch" logic of function pointers we were doing. The benefit is that this is now shared across four functions: {e1,e2} x {pre_compute,handler}. I did not make a single general purpose macro because Rust token parsing rules actually make this hard because it doesn't let you reference ident's you don't declare, and the ident's used actually vary based on the function.Removed
pc_base
and just leave some empty space to avoidpc - pc_base
calculation. This is becauseself.pc_base
is a runtime variable, so we don't want to have to use a register (or worse load/store) to access it.To complete this:
#[create_tco_handler]
attribute aboveexecute_e1_impl
functions. Then for eachExecutor
implementation, copy thepre_compute
function implementation verbatim but switch tohandler
function signature and return the tco handler fn pointer instead.Switch to x86 global asm instead of relying on LLVM if we want to be extra safe.this seemed complicated and hard to do fully properly so I preferbecome
.Closes INT-4309