Skip to content

Conversation

jonathanpwang
Copy link
Contributor

@jonathanpwang jonathanpwang commented Aug 20, 2025

Uses the nightly become keyword to tell LLVM to musttail. This did not work at first when the Handler type returned Result but after I changed it to return nothing it does.

Ideas taken from https://github.com/xacrimon/tcvm/blob/main/src/interp.rs

To remove code duplication, I created local declarative macros dispatch! for any "enum dispatch" logic of function pointers we were doing. The benefit is that this is now shared across four functions: {e1,e2} x {pre_compute,handler}. I did not make a single general purpose macro because Rust token parsing rules actually make this hard because it doesn't let you reference ident's you don't declare, and the ident's used actually vary based on the function.

Removed pc_base and just leave some empty space to avoid pc - pc_base calculation. This is because self.pc_base is a runtime variable, so we don't want to have to use a register (or worse load/store) to access it.

To complete this:

  • For each crate, add "tco" feature, and then add #[create_tco_handler] attribute above execute_e1_impl functions. Then for each Executor implementation, copy the pre_compute function implementation verbatim but switch to handler function signature and return the tco handler fn pointer instead.
  • Do the same for metered execution.
  • Switch to x86 global asm instead of relying on LLVM if we want to be extra safe. this seemed complicated and hard to do fully properly so I prefer become.

Closes INT-4309

@jonathanpwang

This comment was marked as outdated.

@shuklaayush

This comment was marked as resolved.

@jonathanpwang

This comment was marked as resolved.

This comment has been minimized.

Copy link

codspeed-hq bot commented Aug 20, 2025

CodSpeed WallTime Performance Report

Merging #2013 will improve performances by 95.21%

Comparing feat/tco (aee66b1) with main (4852493)

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 18 improvements
✅ 12 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
benchmark_execute[bubblesort] 33.1 ms 21.6 ms +53.62%
benchmark_execute[factorial_iterative_u256] 150.6 ms 84.6 ms +78.03%
benchmark_execute[fibonacci_iterative] 48.8 ms 26.5 ms +84.06%
benchmark_execute[fibonacci_recursive] 75.7 ms 38.8 ms +95.21%
benchmark_execute[keccak256] 38.3 ms 24.7 ms +55.16%
benchmark_execute[keccak256_iter] 126.4 ms 83.7 ms +51.09%
benchmark_execute[pairing] 148.9 ms 134.8 ms +10.45%
benchmark_execute[quicksort] 37.4 ms 23.6 ms +58.44%
benchmark_execute[revm_transfer] 53.6 ms 39.2 ms +36.76%
benchmark_execute[sha256] 37.7 ms 23.2 ms +62.73%
benchmark_execute[sha256_iter] 109.6 ms 60.1 ms +82.4%
benchmark_execute_metered[bubblesort] 67.7 ms 44.4 ms +52.41%
benchmark_execute_metered[factorial_iterative_u256] 252.8 ms 210.8 ms +19.91%
benchmark_execute_metered[fibonacci_recursive] 109.1 ms 91.6 ms +19.09%
benchmark_execute_metered[keccak256_iter] 179.6 ms 147.9 ms +21.4%
benchmark_execute_metered[quicksort] 71.4 ms 50.1 ms +42.53%
benchmark_execute_metered[revm_transfer] 66.7 ms 60.1 ms +11.06%
benchmark_internal_verifier_execute[fibonacci] 46.4 ms 40.4 ms +14.68%

@jonathanpwang jonathanpwang changed the title feat: tail call elimination feat(nightly): tail call elimination Aug 20, 2025
@jonathanpwang jonathanpwang changed the title feat(nightly): tail call elimination feat(nightly): execution becomes faster Aug 21, 2025

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@jonathanpwang jonathanpwang force-pushed the feat/tco branch 3 times, most recently from 4a1e5e8 to e066de4 Compare August 22, 2025 03:55

This comment has been minimized.

return;
}
// exec_state.pc should have been updated by execute_impl at this point
let next_handler = interpreter.get_handler(exec_state.vm_state.pc);
Copy link
Contributor

@shuklaayush shuklaayush Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both get_handler and get_pre_compute call get_pc_index. maybe we can calculate the index once and reuse it
or add a get_pre_handler_and_pre_compute function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compiler should optimize this out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually this is the new pc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still think there's duplicate work happening here that won't be optimized out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you mean like because new pc doesn't get passed across function boundary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

}
// exec_state.pc should have been updated by execute_impl at this point
let next_handler = interpreter.get_handler(exec_state.vm_state.pc);
if next_handler.is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required rn but maybe we can just return a null pointer instead of Option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again I hope the compiler just optimizes this out

let handler_fn = quote! {
#[inline(never)]
unsafe fn #handler_name #handler_generics (
interpreter: &::openvm_circuit::arch::interpreter::InterpretedInstance<#f_type, #ctx_type>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it matters but maybe we just pass like a slimmed down instruction table that contains a function/mapping from pc -> (data, handler) instead of a reference to this whole struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea will do with the register pinning

Copy link
Contributor

@shuklaayush shuklaayush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. there's an argument that we can have a single handler function instead of separate precompute and handler functions but i think the current approach with using macros to avoid duplication is fine too

Comment on lines -45 to +50
/// `pc_index = (pc - pc_base) / DEFAULT_PC_STEP`.
/// `pc_index = pc / DEFAULT_PC_STEP`.
/// SAFETY: The first `pc_base / DEFAULT_PC_STEP` entries will be unreachable. We do this to
/// avoid needing to subtract `pc_base` during runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is pc_base guaranteed to be bounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it's bounded by the ELF size

}
#[cfg(feature = "tco")]
{
tracing::debug!("execute_tco");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly pedantic, but i feel the tail-recursive function is more like a "trampoline"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I thought (but didn't validate) that other interpreters refer to trampoline as the standard one?

return;
}
// exec_state.pc should have been updated by execute_impl at this point
let next_handler = interpreter.get_handler(exec_state.vm_state.pc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still think there's duplicate work happening here that won't be optimized out

Copy link
Contributor

@nyunyunyunyu nyunyunyunyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else besides Ayush's comments. LGTM if the comments are addressed

Copy link

group app.proof_time_ms app.cycles app.cells_used leaf.proof_time_ms leaf.cycles leaf.cells_used
verify_fibair 2,078 322,700 18,750,324 - - -
fibonacci (-36 [-1.5%]) 2,379 1,500,210 51,504,507 - - -
regex (-111 [-1.5%]) 7,525 4,108,597 164,734,992 - - -
ecrecover (-9 [-0.6%]) 1,377 140,487 8,866,654 - - -
pairing (-41 [-1.1%]) 3,855 1,882,939 98,834,293 - - -

Commit: aee66b1

Benchmark Workflow

@jonathanpwang jonathanpwang added this pull request to the merge queue Aug 22, 2025
Merged via the queue into main with commit aad0172 Aug 22, 2025
34 checks passed
@jonathanpwang jonathanpwang deleted the feat/tco branch August 22, 2025 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants