core/vm: speed up push and interpreter loop #30662

holiman · 2024-10-23T08:18:13Z

Looking at the cpu profile of a burntpix benchmark, I noticed that a lot of time was spent in gas-used, in the interpreter loop. It's an actual call (not inlined), which explicitly wants to be ignored by tracing ("tracing.GasChangeIgnored"), so it can be safely and simply inlined.

The other change is in pushX. These also do a call to common.RightPadBytes. I replaced that by a doing a corresponding Lsh on the u256 if needed. Note: it's needed only to make the stack output look right, for fuzzers. It technically doesn't matter what we put there: if code ends on a pushdata immediate, nothing will consume the stack element. We could just as well just ignore it, if we didn't care about fuzzers (which I do).

Seems quite a lot faster on burntpix, according to my runs.

This PR:

EVM gas used:    5642735088
execution time:  34.84609475s
allocations:     915683
allocated bytes: 175334088

EVM gas used:    5642735088
execution time:  36.671958278s
allocations:     915701
allocated bytes: 175340528

Master

EVM gas used:    5642735088
execution time:  49.349209526s
allocations:     915684
allocated bytes: 175333368

EVM gas used:    5642735088
execution time:  46.581006598s
allocations:     915681
allocated bytes: 175330728

holiman · 2024-10-23T08:38:11Z

Btw if we want this, we can also remove tracing.GasChangeIgnored. cc @s1na ?

holiman · 2024-10-23T11:43:49Z

Added a benchmark

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/vm/runtime
cpu: 12th Gen Intel(R) Core(TM) i7-1270P
                        │ loop.master  │            loop.master.2             │
                        │    sec/op    │    sec/op     vs base                │
SimpleLoop/loop2-100M-8   846.1m ± 15%   645.2m ± 33%  -23.74% (p=0.020 n=20)

                        │  loop.master  │          loop.master.2          │
                        │     B/op      │     B/op       vs base          │
SimpleLoop/loop2-100M-8   6.365Ki ± 13%   6.570Ki ± 17%  ~ (p=0.640 n=20)

                        │ loop.master │        loop.master.2         │
                        │  allocs/op  │ allocs/op   vs base          │
SimpleLoop/loop2-100M-8    46.00 ± 4%   46.00 ± 4%  ~ (p=0.576 n=20)

jwasinger · 2024-10-23T14:59:07Z

core/vm/instructions.go


+		// Missing bytes: pushByteSize - len(pushData)
+		if missing := pushByteSize - (end - start); missing > 0 {


Even if we could remove this if from the hot-loop and adjust the tracers correspondingly, it's (understandably) a micro optimization that doesn't bring measurable performance benefits:

│ bench_if_removed.txt │ bench_if_not_removed.txt │ │ sec/op │ sec/op vs base │ SimpleLoop/staticcall-identity-100M-12 162.9m ± 0% 160.7m ± 2% -1.35% (p=0.009 n=10) SimpleLoop/call-identity-100M-12 183.9m ± 2% 180.8m ± 1% -1.64% (p=0.001 n=10) SimpleLoop/loop-100M-12 182.0m ± 1% 182.0m ± 0% ~ (p=0.853 n=10) SimpleLoop/loop2-100M-12 200.7m ± 0% 208.0m ± 5% +3.68% (p=0.000 n=10) SimpleLoop/call-nonexist-100M-12 496.8m ± 2% 508.8m ± 9% +2.41% (p=0.011 n=10) SimpleLoop/call-EOA-100M-12 142.3m ± 0% 141.1m ± 5% ~ (p=0.143 n=10) SimpleLoop/call-reverting-100M-12 323.7m ± 0% 324.9m ± 1% ~ (p=0.123 n=10) geomean 220.1m 220.9m +0.37%

core/vm/interpreter.go

Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>

Looking at the cpu profile of a burntpix benchmark, I noticed that a lot of time was spent in gas-used, in the interpreter loop. It's an actual call (not inlined), which explicitly wants to be ignored by tracing ("tracing.GasChangeIgnored"), so it can be safely and simply inlined. The other change is in `pushX`. These also do a call to `common.RightPadBytes`. I replaced that by a doing a corresponding `Lsh` on the `u256` if needed. Note: it's needed only to make the stack output look right, for fuzzers. It technically doesn't matter what we put there: if code ends on a pushdata immediate, nothing will consume the stack element. We could just as well just ignore it, if we didn't care about fuzzers (which I do). Seems quite a lot faster on burntpix, according to my runs. This PR: ``` EVM gas used: 5642735088 execution time: 34.84609475s allocations: 915683 allocated bytes: 175334088 ``` ``` EVM gas used: 5642735088 execution time: 36.671958278s allocations: 915701 allocated bytes: 175340528 ``` Master ``` EVM gas used: 5642735088 execution time: 49.349209526s allocations: 915684 allocated bytes: 175333368 ``` ``` EVM gas used: 5642735088 execution time: 46.581006598s allocations: 915681 allocated bytes: 175330728 ``` --------- Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com> Co-authored-by: Felix Lange <fjl@twurst.com>

holiman requested review from karalabe and rjl493456442 as code owners October 23, 2024 08:18

holiman added 2 commits October 23, 2024 13:42

core/vm/runtime: benchtest with pushing/popping looping code

287aa9c

core/vm: speed up push and interpreter loop

8c582ab

holiman force-pushed the speedy_interpreter_loop branch from 63d2422 to 8c582ab Compare October 23, 2024 11:42

jwasinger reviewed Oct 23, 2024

View reviewed changes

s1na reviewed Oct 24, 2024

View reviewed changes

core/vm/interpreter.go Outdated Show resolved Hide resolved

s1na reviewed Oct 24, 2024

View reviewed changes

core/vm/interpreter.go Outdated Show resolved Hide resolved

holiman and others added 3 commits October 24, 2024 09:03

core/vm: doc-suggestions from code review

0572bcd

Co-authored-by: Sina M <1591639+s1na@users.noreply.github.com>

core/vm: lint fixes

ab68830

core/vm: gofmt

bf83e7c

fjl added this to the 1.14.12 milestone Oct 30, 2024

fjl merged commit 25bc077 into ethereum:master Oct 30, 2024
3 checks passed

BrewTestBot mentioned this pull request Nov 19, 2024

ethereum 1.14.12 Homebrew/homebrew-core#198233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/vm: speed up push and interpreter loop #30662

core/vm: speed up push and interpreter loop #30662

holiman commented Oct 23, 2024

holiman commented Oct 23, 2024

holiman commented Oct 23, 2024

jwasinger Oct 23, 2024 •

edited

Loading


		// Missing bytes: pushByteSize - len(pushData)
		if missing := pushByteSize - (end - start); missing > 0 {

core/vm: speed up push and interpreter loop #30662

core/vm: speed up push and interpreter loop #30662

Conversation

holiman commented Oct 23, 2024

holiman commented Oct 23, 2024

holiman commented Oct 23, 2024

jwasinger Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

jwasinger Oct 23, 2024 •

edited

Loading