Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(push): right pad if push parameter + PC +1 exceeds code length #7834

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

letypequividelespoubelles
Copy link
Contributor

@letypequividelespoubelles letypequividelespoubelles commented Oct 30, 2024

PR description

The EYP states that the "push value" produced by a PUSHX instruction must be constructed as follows: grab X many bytes starting at offset pc + 1 in the bytecode, where, if you run out of bytecode, you replace these bytes with 00's.
This PR fixes the missing right padding with 00's in the (rare) case you run out of bytecode.

Recall that we may need to right pad the push value at most once per transaction: when we need to right pad, it means we have no more byte code available, which means the next instruction will be a STOP, thus the end of the execution.

Given that the gas cost of a PUSHX is constant, and that the following opcode is a STOP it means also that this value pushed on the stack has no impact nor on the evm execution nor on the committed change to the state.

However it profoundly messes with our (Layer2 zkrollup) Linea's arithmetization processing of PUSH instructions , as it completely deviates from the EYP (and the execution-specs, as well as Geth.)

FYI @OlivierBBB

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a unit test that asserts this behavior since it has been technically out of spec for some time.

// Right Pad the push with 0s up to pushSize if greater than the copyLength
final int rightPad = pushSize - copyLength;
if (rightPad != 0) {
push = push.shiftLeft(rightPad * 8, MutableBytes.create(pushSize));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need a similar behavior if code.length <= copyStart? or will the ROM module treat a single zero byte as equivalent to n zero bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be equivalent for us

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shiftLeft has awful performance for what needs to be done here, just moving bytes around.

There is no arbitrary-lengh equivelant, but use a construct siliar to Bytes32.leftPad() - https://github.com/tmio/tuweni/blob/aeeb06bc12d4f21a725c74ec86e48396b7f23581/bytes/src/main/java/org/apache/tuweni/bytes/Bytes32.java#L152-L154 - except make the mutableBytes pushSize length.

Copy link
Contributor

@lu-pinto lu-pinto Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely add something to the execution time. My preference would be to not change it at all since there's no consensus issue and PUSH is quite hot. It's either extra object creation with Bytes or array resize backing the stack implementation. We might think this is less frequent than it actually is but I could see code being compacted by compiler legitemately for storage purposes. We could choose to handle it like a kind of virtual pushN without changing the stack. Could we back this out at some point?
I.e. I see this as an implementation detail as we could have actually stack compaction for instance to optimise things.

Anyhow if we still want to proceed the best way would be to create a Bytes object with MutableBytes initialised to 0 and then use 'set' to add the value on it. This prevents creation of extra objects here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what we do in Linea's project when we want to rightPad, because Tuweni library doesn't have an easy left/right padding: https://github.com/Consensys/linea-tracer/blob/624f9cf971c4ae2a8089c41696fdec98f520a411/arithmetization/src/main/java/net/consensys/linea/zktracer/types/Utils.java#L48-L51

Would you be ok with something like this ? If yes, where should I write this Bytes library extension method ?

Copy link
Contributor

@shemnon shemnon Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tuweni Bytes package should just be brought back into Besu (it started there when it was called pantheon). That way when we need to add something such as right pad we do it directly on the MutableBytes object, instead of copying implementations into random places.

It also allows for a tighter optimization loop when fixing performance issues.

For example, the best API would be copy(byte[] data, int offset, int len) and in the impl we would use System.arrayCopy to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.

Copy link
Contributor

@shemnon shemnon Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would an optimized compiler end on an incomplete PUSH ? The same code without the final incomplete PUSH would be functionally equivalent and more optimized, unless the purpose is to burn 3 gas on the final instruction ^^ The point being that the value being pushed cannot be used.

An optimized compiler wouldn't, but a troll would. The traces would be different against different major EVM implementations, which is why obsessive attention to the spec is needed. This is the sorts of defense in depth that will pay silent dividends.

Copy link

@OlivierBBB OlivierBBB Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. "Obsessive attention to the spec" is what led @letypequividelespoubelles, @lorenzo.gentile and myself to write tests with pathological bytecode and to detect this deviation from the Yellow paper in the first place.

Copy link
Contributor

@ahamlat ahamlat Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, the best API would be copy(byte[] data, int offset, int len) and in the impl we would use System.arrayCopy to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.

This is how copyTo works currently in ArrayWrappingBytes, but in the other sens

Copy link
Contributor

@lu-pinto lu-pinto Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tuweni Bytes package should just be brought back into Besu (it started there when it was called pantheon). That way when we need to add something such as right pad we do it directly on the MutableBytes object, instead of copying implementations into random places.

indeed +1 on bringing tuweni into besu or at least have same maintainers as in besu so we can all make updates to it. Having code modularised is also not bad though, TBH. This framework sits at the core of the besu client, I don't understand how is it close to impossible to modify it. It's not even modernized anymore, still builds with java 8 or 11.

For example, the best API would be copy(byte[] data, int offset, int len) and in the impl we would use System.arrayCopy to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.

There's MutableArrayWrappingBytes.set(int, Bytes) which uses System.arraycopy but no MutableArrayWrappingBytes.set(int, byte[]) or similar. I don't see how a copy method would help because in the end you have to return Bytes so it's another object to create when you return, unless you are referring to MutableBytes, but set would be the most fitting name.

What I think we need to do is work a lot more with MutableBytes and less with Bytes - only work with Bytes if you are storing it as a final field in a class (i.e. it represents data and not state) - and we need to make some API changes in tuweni to motivate developers to do this.

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test assertion is a nice to have, but approving to unblock.

@shemnon
Copy link
Contributor

shemnon commented Oct 31, 2024

This must have a unit tests and it should have some performance assessment done, and were I still a maintainer I would block this until they were present.

Push is a particularly "hot" instruction and should be handled with care. You can easily nuke performance getting it wrong.

@letypequividelespoubelles
Copy link
Contributor Author

Push is a particularly "hot" instruction and should be handled with care. You can easily nuke performance getting it wrong.

I do agree on this. I remind that we may need to right pad the push value at most once per transaction: when we need to right pad, it means we have no more byte code available, which means the next instruction will be a STOP, thus the end of the execution.

Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
@matkt
Copy link
Contributor

matkt commented Oct 31, 2024

just ping @ahamlat as we are talking about performance impact

@OlivierBBB
Copy link

I remind that we may need to right pad the push value at most once per transaction

To be fair: at most once per transaction or internal transaction (i.e. when executing a CALL or a CREATE.)

@garyschulte
Copy link
Contributor

This must have a unit tests and it should have some performance assessment done, and were I still a maintainer I would block this until they were present.

Push is a particularly "hot" instruction and should be handled with care. You can easily nuke performance getting it wrong.

Thanks for raising the shift left concern. Bringing Bytes back into besu is likely the right solution. In the interim, we should quantify the performance cost of various options. Shift left, Bytes.concatenate, etc.

We are technically out of spec now and would have different traces than geth in this case. But thankfully it is not a consensus issue.

@ahamlat
Copy link
Contributor

ahamlat commented Nov 1, 2024

I have been running this PR on two instances for one day, and the commit before on two other instances to test real current blocks on mainnet. I don't see a difference in terms of performance in block processing.

image

I also created a custom state test, and haven't seen a difference in the performances which a bit weird as the test consists of repeating PUSH32 0xff on the stack. @shemnon Could you check if this test file
jd_attack_london_40MGas_PUSH32_250M.json is valid ?

They key metric here mgps and both have around 92 mgas/s.

This PR

{"output":"","gasUsed":"0x2625a00","time":430853614,"Mgps":"92.839","test":"jd_attack_london_40MGas_PUSH32_250M","fork":"London","d":0,"g":0,"v":0,"postHash":"0x098ee441971f62b1a8db971d8d98491fa97a69217f41702d410a146e700248f6","postLogsHash":"0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347","pass":false,"error":"INSUFFICIENT_GAS"}

The commit before

{"output":"","gasUsed":"0x2625a00","time":434550022,"Mgps":"92.049","test":"jd_attack_london_40MGas_PUSH32_250M","fork":"London","d":0,"g":0,"v":0,"postHash":"0x098ee441971f62b1a8db971d8d98491fa97a69217f41702d410a146e700248f6","postLogsHash":"0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347","pass":false,"error":"INSUFFICIENT_GAS"}

gas-benchmarks doesn't have a performance test for Push1-32, only for Push0 which is not impacted by this PR.

@ahamlat
Copy link
Contributor

ahamlat commented Nov 1, 2024

Please don't merge the PR yet, I have to do more tests

@OlivierBBB
Copy link

OlivierBBB commented Nov 1, 2024

@ahamlat Your test vector won't trigger this behaviour. Looking at the code you have a great number of repetitions of

code: "0x 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff ..."

But those aren't incomplete PUSH32 instructions: the code doesn't run out. In fact what you should see when executing this code is that you execute a precisely two instructions, namely a PUSH32 followed by SELFDESTRUCT (which has byte value ff)

PUSH32 0x ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f
SELFDESTRUCT

The main point is that

The only time this "incomplete push" behaviour can be triggered is if the incomplete push is at the tail end of the bytecode, seen as a string of bytes.

How to trigger the java code

To test this code properly you would have to produce e.g. some bytecode that would otherwise terminate on a STOP (either explicitly in the bytecode or implicitly through bytecode padding) and replace it with an incomplete push.

Minimal, most antagonistic and performance degrading test

The most radical case being tiny bytecode comprised of k + 1 bytes that looks like so

code: "0x [PUSHX] ff ff ff ... ff"

with 0 ≤ k < X repetitions of ff and [PUSHX] = 0x5f + X is the byte value of PUSHX. This is the case where the bytecode is comprised of a single incomplete PUSH instruction.

More realistic, less antagonistic test

Another, less adversarial case, would be e.g. real byte code à la

code: "0x 60 00 60 80 52 ... [some actual byte code] ... XX YY ZZ (00)"

Where the final instruction, the STOP following the ZZ, may have been omitted by the compiler as it exists implicitly by right 00 padding, gets executed on a normal execution of the bytecode. One would then (if present) replace the 00 by e.g. PUSHX followed by 0 ≤ k < X repetitions of ff:

code: "0x 60 00 60 80 52 ... [some actual byte code] ... XX YY ZZ [PUSHX] ff ff ... ff"

Existing test vectors

We have written such tests as described in Consensys/linea-tracer#1412. These are the test families that blew up our arithmetization, more specifically the lookup HUB -> ROM for instruction fetching (which extracts the PUSH_VALUE at the same time as the opcode.

@ahamlat
Copy link
Contributor

ahamlat commented Nov 1, 2024

Thanks @OlivierBBB, you're right, push32 with take as input the next 32 bytes. This explains why the results are the same in terms of performance. Let me review my test. I will do microbenchmarking tests, and share the results

@ahamlat
Copy link
Contributor

ahamlat commented Nov 2, 2024

I isolated the code that creates the push bytes (without the message frame) and evaluated the overhead compared to current version. I created an initial bytecode of 32 bytes and have different values for PC (28 or 29), and for push size (3 or 32) :

Benchmark                                    (PC)  (pushSize)  Mode  Cnt   Score   Error  Units
PushOperationBenchmark.benchmarkOldPushImpl    28           3  avgt   50   2.883 ± 0.132  ns/op
PushOperationBenchmark.benchmarkOldPushImpl    28          32  avgt   50   2.840 ± 0.056  ns/op
PushOperationBenchmark.benchmarkOldPushImpl    29           3  avgt   50   2.873 ± 0.048  ns/op
PushOperationBenchmark.benchmarkOldPushImpl    29          32  avgt   50   2.914 ± 0.079  ns/op
PushOperationBenchmark.benchmarkNewPushImpl    28           3  avgt   50   4.510 ± 0.268  ns/op
PushOperationBenchmark.benchmarkNewPushImpl    28          32  avgt   50  24.007 ± 0.426  ns/op
PushOperationBenchmark.benchmarkNewPushImpl    29           3  avgt   50  16.652 ± 0.249  ns/op
PushOperationBenchmark.benchmarkNewPushImpl    29          32  avgt   50  23.313 ± 1.332  ns/op

The push bytes are created in around 2.9 ns for push3 and push32 with main implementation (before this PR). The cost is also the same for the tested cases where a padding is necessary (as it is not done currently on main).

The overhead without padding which is the most important in our case, as it should impact all the pushX opcodes, is around 1.6 ns (4.5 ns - 2.9 ns). This means that 625000 pushX opcodes will have an overhead of 1 ms. Theoretically, with 30 mgas block size, we can have up to 10 million pushX operations, and this block would generate an overhead of 16 ms. With that in mind, 99% of blocks will contain fewer than 625,000 PUSHx operations, resulting in an overhead of less than 1 ms.

This is represented by this test as there is no padding in this case :

PushOperationBenchmark.benchmarkNewPushImpl    28           3  avgt   50   4.510 ± 0.268  ns/op

The worst case is on push32 with an overhead of 21 ns per execution with padding (24 ns - 2.9 ns). I'm not sure here what is the worst case in terms of number of push32 we can have with padding on a 30 million block. I know that we can have one per transaction but each transaction can have n create and m calls, and for each of these create and calls, we can have Push32 calls.

@letypequividelespoubelles @OlivierBBB any idea on the maximum number of push32 with padding, that we can have on a 30 mgas block ?

@OlivierBBB
Copy link

OlivierBBB commented Nov 2, 2024

You could have a smart contract like so

JUMPDEST
PUSH1 0x00
PUSH1 0x00
PUSH1 0x00
PUSH1 0x00
PUSH20 <address of a SMC whose bytecode is a single byte = 0x7f = PUSH32>
PUSH1 0x03
STATICCALL
POP
PUSH1 0x00
JUMP

Which performs these incomplete PUSH's in a loop. Every loop costs 2 + 3*6 + 100 + 2 + 3 + 8 = 133 plus 3 gas (the first one costs an extra 2500 more since the target will be cold) so you can do this n times as long as 21,000 + 2,500 + n*136 <= 30,000,000 which is a lot.

Here the bytecode of the target would only have size 1 and contain the single byte 0x7f.

You could theoretically play this game 220,415 times

@ahamlat
Copy link
Contributor

ahamlat commented Nov 2, 2024

I tested this implementation and the results looks much better related to cases where padding is applied :

 @Benchmark
    public void benchmarkNewPushWithMutableByteImpl(final Blackhole bh) {
        int copyStart = PC + 1;
        Bytes push ;
        if (byteCode.length <= copyStart) {
            push = Bytes.EMPTY;
        } else {
            final int copyLength = Math.min(pushSize, byteCode.length - PC - 1);
            final int rightPad = pushSize - copyLength;
            if (rightPad == 0) {
                push = Bytes.wrap(byteCode, copyStart, copyLength);
            } else {
                var bytecodeLocal = new byte[pushSize];
                System.arraycopy(byteCode, copyStart, bytecodeLocal, 0, copyLength);
                System.arraycopy(new byte[rightPad], 0, bytecodeLocal, copyLength, rightPad);
                push = Bytes.wrap(bytecodeLocal);
            }
        }
        bh.consume(push);
    }

The benchmark results

Benchmark                                                   (PC)  (pushSize)  Mode  Cnt   Score   Error  Units
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    28           3  avgt   50   4.395 ± 0.277  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    28          32  avgt   50  12.944 ± 0.432  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    29           3  avgt   50  12.409 ± 0.283  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    29          32  avgt   50  12.828 ± 0.425  ns/op

With this implementation, the overhead on all PushX is still the same, around 1.5 ns, or 1 ms for 625,000 PUSHx operations, because the current implementation baseline is 2.9 ns.
The overhead on the worst case with padding for each execution is around 10 ns (12.8 ns - 2.9), following what @OlivierBBB shared, the total overhead is 220,415 * 10 ns = 2.2 ms, which is an acceptable overhead for the worst case.

Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
@ahamlat
Copy link
Contributor

ahamlat commented Nov 4, 2024

@letypequividelespoubelles noticed that this line is not necessary

System.arraycopy(new byte[rightPad], 0, bytecodeLocal, copyLength, rightPad);

which is actually true because var bytecodeLocal = new byte[pushSize]; already initialized to 0 the padding bytes.
I launched the microbenchmark, and the overhead is even lower when padding is performed, around 5 ns per PUSHx execution.

Benchmark                                                   (PC)  (pushSize)  Mode  Cnt  Score   Error  Units
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    28           3  avgt   50  4.306 ± 0.058  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    28          32  avgt   50  8.060 ± 0.204  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    29           3  avgt   50  7.334 ± 0.227  ns/op
PushOperationBenchmark.benchmarkNewPushWithMutableByteImpl    29          32  avgt   50  7.874 ± 0.114  ns/op

@shemnon could you review again ?

@shemnon
Copy link
Contributor

shemnon commented Nov 4, 2024

Tests are present and the code has been adjusted based on benchmarking. My concerns are satisfied.

@garyschulte garyschulte enabled auto-merge (squash) November 4, 2024 23:29
@garyschulte garyschulte merged commit ef9d1ab into hyperledger:main Nov 4, 2024
43 checks passed
JanetMo pushed a commit to JanetMo/besu that referenced this pull request Nov 17, 2024
…yperledger#7834)

* fix(push): right pad if push parameter exceeds code length
* test: add push operation padding tests
* better perf rightPadding thnks to Ameziane

Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
Co-authored-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Marlene Marz <m.marz@kabelmail.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants