Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: More efficient touch_range #1322

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

perf: More efficient touch_range #1322

wants to merge 1 commit into from

Conversation

zlangley
Copy link
Contributor

@zlangley zlangley commented Jan 29, 2025

Previously touch_range was implemented by calling touch_address on each address in pointer..pointer + len. But for persistent memory, in both MemoryMerkleChip and PersistentBoundaryChip, we actually care about touched aligned blocks of size 8. So most of the triggered calls to touch_address via a single touch_range were redundant, repeatedly querying the same hashmap at the same block index.

This comment has been minimized.

@jonathanpwang
Copy link
Contributor

can you add explanation of what the optimization is, and summarize what the perf gain is for microbenchmarks as well as reth benchmark

@jonathanpwang
Copy link
Contributor

seems to have no perf diff

let first_address_label = address / CHUNK as u32;
let last_address_label = (address + len - 1) / CHUNK as u32;
for address_label in first_address_label..=last_address_label {
self.touch_node(0, as_label, address_label);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still internally calling the same function, I'm guessing the compiler just optimizes it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the compiler optimize it out? Previously this function would have been called N times, now it is called N/8 times.

@jonathanpwang
Copy link
Contributor

Closing for now because it doesn't seem to have a real perf impact, and introduces more code

@zlangley
Copy link
Contributor Author

@jonathanpwang I have a hard time believing this has no real perf impact. Local microbenchmark shows that MemoryInterface::touch_range is 17% of MemoryController::finalize before this change and 2.7% afterwards.

@jonathanpwang jonathanpwang reopened this Jan 29, 2025

This comment has been minimized.

@zlangley
Copy link
Contributor Author

still not seeing a lot of perf diff: https://github.com/axiom-crypto/openvm-reth-benchmark/blob/gh-pages/benchmarks-dispatch/refs/heads/main/reth-e9fe5226fd30a7646b0e44d3c1d0681ea961dcd5deeb55e4fd95bf88dd4dfd0f.md

I only expect reth.prove_e2e.block_21000000 trace_gen_time_ms to improve

@Golovanov399
Copy link
Contributor

Do we ever call touch_address after this change? If no, then it makes sense to remove this function as you did with all_addresses()

Copy link

group app.proof_time_ms app.cycles app.cells_used leaf.proof_time_ms leaf.cycles leaf.cells_used
verify_fibair (-17 [-0.8%]) 2,156 513,786 18,710,395 - - -
fibonacci_program (-295 [-5.6%]) 4,978 1,500,095 51,485,080 - - -
regex_program (-395 [-2.6%]) 14,884 1,914,103 165,455,373 - - -
ecrecover_program (+17 [+0.7%]) 2,522 284,567 15,055,723 - - -

Commit: d49e76a

Benchmark Workflow

@zlangley
Copy link
Contributor Author

zlangley commented Jan 30, 2025

On my machine, trace gen for reth benchmark for block 21000000 seems to decrease by 45s (~9%). Not sure why this is not reflected on reth benchmark (not sure how memory allocator/machine architecture is really relevant here; this PR should just be strictly decreasing the number of calls to HashMap::insert).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants