-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(push): right pad if push parameter + PC +1
exceeds code length
#7834
Conversation
Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have a unit test that asserts this behavior since it has been technically out of spec for some time.
// Right Pad the push with 0s up to pushSize if greater than the copyLength | ||
final int rightPad = pushSize - copyLength; | ||
if (rightPad != 0) { | ||
push = push.shiftLeft(rightPad * 8, MutableBytes.create(pushSize)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need a similar behavior if code.length <= copyStart? or will the ROM module treat a single zero byte as equivalent to n
zero bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be equivalent for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shiftLeft has awful performance for what needs to be done here, just moving bytes around.
There is no arbitrary-lengh equivelant, but use a construct siliar to Bytes32.leftPad() - https://github.com/tmio/tuweni/blob/aeeb06bc12d4f21a725c74ec86e48396b7f23581/bytes/src/main/java/org/apache/tuweni/bytes/Bytes32.java#L152-L154 - except make the mutableBytes pushSize length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will definitely add something to the execution time. My preference would be to not change it at all since there's no consensus issue and PUSH is quite hot. It's either extra object creation with Bytes or array resize backing the stack implementation. We might think this is less frequent than it actually is but I could see code being compacted by compiler legitemately for storage purposes. We could choose to handle it like a kind of virtual pushN without changing the stack. Could we back this out at some point?
I.e. I see this as an implementation detail as we could have actually stack compaction for instance to optimise things.
Anyhow if we still want to proceed the best way would be to create a Bytes object with MutableBytes initialised to 0 and then use 'set' to add the value on it. This prevents creation of extra objects here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what we do in Linea's project when we want to rightPad, because Tuweni library doesn't have an easy left/right padding: https://github.com/Consensys/linea-tracer/blob/624f9cf971c4ae2a8089c41696fdec98f520a411/arithmetization/src/main/java/net/consensys/linea/zktracer/types/Utils.java#L48-L51
Would you be ok with something like this ? If yes, where should I write this Bytes
library extension method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Tuweni Bytes package should just be brought back into Besu (it started there when it was called pantheon). That way when we need to add something such as right pad we do it directly on the MutableBytes object, instead of copying implementations into random places.
It also allows for a tighter optimization loop when fixing performance issues.
For example, the best API would be copy(byte[] data, int offset, int len)
and in the impl we would use System.arrayCopy
to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would an optimized compiler end on an incomplete
PUSH
? The same code without the final incompletePUSH
would be functionally equivalent and more optimized, unless the purpose is to burn 3 gas on the final instruction ^^ The point being that the value being pushed cannot be used.
An optimized compiler wouldn't, but a troll would. The traces would be different against different major EVM implementations, which is why obsessive attention to the spec is needed. This is the sorts of defense in depth that will pay silent dividends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. "Obsessive attention to the spec" is what led @letypequividelespoubelles, @lorenzo.gentile and myself to write tests with pathological bytecode and to detect this deviation from the Yellow paper in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the best API would be copy(byte[] data, int offset, int len) and in the impl we would use System.arrayCopy to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.
This is how copyTo works currently in ArrayWrappingBytes, but in the other sens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Tuweni Bytes package should just be brought back into Besu (it started there when it was called pantheon). That way when we need to add something such as right pad we do it directly on the MutableBytes object, instead of copying implementations into random places.
indeed +1 on bringing tuweni into besu or at least have same maintainers as in besu so we can all make updates to it. Having code modularised is also not bad though, TBH. This framework sits at the core of the besu client, I don't understand how is it close to impossible to modify it. It's not even modernized anymore, still builds with java 8 or 11.
For example, the best API would be copy(byte[] data, int offset, int len) and in the impl we would use System.arrayCopy to do the copy since it has access to the array, and the hotspot compiler has an intrinsic for it.
There's MutableArrayWrappingBytes.set(int, Bytes)
which uses System.arraycopy
but no MutableArrayWrappingBytes.set(int, byte[])
or similar. I don't see how a copy
method would help because in the end you have to return Bytes
so it's another object to create when you return, unless you are referring to MutableBytes
, but set
would be the most fitting name.
What I think we need to do is work a lot more with MutableBytes
and less with Bytes
- only work with Bytes
if you are storing it as a final field in a class (i.e. it represents data and not state) - and we need to make some API changes in tuweni to motivate developers to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test assertion is a nice to have, but approving to unblock.
This must have a unit tests and it should have some performance assessment done, and were I still a maintainer I would block this until they were present. Push is a particularly "hot" instruction and should be handled with care. You can easily nuke performance getting it wrong. |
I do agree on this. I remind that we may need to right pad the push value at most once per transaction: when we need to right pad, it means we have no more byte code available, which means the next instruction will be a |
Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
just ping @ahamlat as we are talking about performance impact |
To be fair: at most once per transaction or internal transaction (i.e. when executing a |
Thanks for raising the shift left concern. Bringing Bytes back into besu is likely the right solution. In the interim, we should quantify the performance cost of various options. Shift left, Bytes.concatenate, etc. We are technically out of spec now and would have different traces than geth in this case. But thankfully it is not a consensus issue. |
I have been running this PR on two instances for one day, and the commit before on two other instances to test real current blocks on mainnet. I don't see a difference in terms of performance in block processing. I also created a custom state test, and haven't seen a difference in the performances which a bit weird as the test consists of repeating They key metric here mgps and both have around 92 mgas/s. This PR
The commit before
gas-benchmarks doesn't have a performance test for Push1-32, only for Push0 which is not impacted by this PR. |
Please don't merge the PR yet, I have to do more tests |
@ahamlat Your test vector won't trigger this behaviour. Looking at the code you have a great number of repetitions of code: "0x 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff ..." But those aren't incomplete PUSH32 0x ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f ff 7f
SELFDESTRUCT The main point is that
How to trigger the java codeTo test this code properly you would have to produce e.g. some bytecode that would otherwise terminate on a STOP (either explicitly in the bytecode or implicitly through bytecode padding) and replace it with an incomplete push. Minimal, most antagonistic and performance degrading testThe most radical case being tiny bytecode comprised of code: "0x [PUSHX] ff ff ff ... ff" with More realistic, less antagonistic testAnother, less adversarial case, would be e.g. real byte code à la code: "0x 60 00 60 80 52 ... [some actual byte code] ... XX YY ZZ (00)" Where the final instruction, the code: "0x 60 00 60 80 52 ... [some actual byte code] ... XX YY ZZ [PUSHX] ff ff ... ff" Existing test vectorsWe have written such tests as described in Consensys/linea-tracer#1412. These are the test families that blew up our arithmetization, more specifically the lookup HUB -> ROM for instruction fetching (which extracts the PUSH_VALUE at the same time as the opcode. |
Thanks @OlivierBBB, you're right, push32 with take as input the next 32 bytes. This explains why the results are the same in terms of performance. Let me review my test. I will do microbenchmarking tests, and share the results |
I isolated the code that creates the push bytes (without the message frame) and evaluated the overhead compared to current version. I created an initial bytecode of 32 bytes and have different values for PC (28 or 29), and for push size (3 or 32) :
The push bytes are created in around 2.9 ns for push3 and push32 with main implementation (before this PR). The cost is also the same for the tested cases where a padding is necessary (as it is not done currently on main). The overhead without padding which is the most important in our case, as it should impact all the pushX opcodes, is around 1.6 ns (4.5 ns - 2.9 ns). This means that 625000 pushX opcodes will have an overhead of 1 ms. Theoretically, with 30 mgas block size, we can have up to 10 million pushX operations, and this block would generate an overhead of 16 ms. With that in mind, 99% of blocks will contain fewer than 625,000 PUSHx operations, resulting in an overhead of less than 1 ms. This is represented by this test as there is no padding in this case :
The worst case is on push32 with an overhead of 21 ns per execution with padding (24 ns - 2.9 ns). I'm not sure here what is the worst case in terms of number of push32 we can have with padding on a 30 million block. I know that we can have one per transaction but each transaction can have n create and m calls, and for each of these create and calls, we can have Push32 calls. @letypequividelespoubelles @OlivierBBB any idea on the maximum number of push32 with padding, that we can have on a 30 mgas block ? |
You could have a smart contract like so
Which performs these incomplete PUSH's in a loop. Every loop costs Here the bytecode of the target would only have size 1 and contain the single byte You could theoretically play this game |
I tested this implementation and the results looks much better related to cases where padding is applied :
The benchmark results
With this implementation, the overhead on all PushX is still the same, around 1.5 ns, or 1 ms for 625,000 PUSHx operations, because the current implementation baseline is 2.9 ns. |
Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch>
@letypequividelespoubelles noticed that this line is not necessary
which is actually true because
@shemnon could you review again ? |
Tests are present and the code has been adjusted based on benchmarking. My concerns are satisfied. |
…yperledger#7834) * fix(push): right pad if push parameter exceeds code length * test: add push operation padding tests * better perf rightPadding thnks to Ameziane Signed-off-by: F Bojarski <ceciestunepoubelle@protonmail.ch> Co-authored-by: Ameziane H. <ameziane.hamlat@consensys.net> Signed-off-by: Marlene Marz <m.marz@kabelmail.de>
PR description
The EYP states that the "push value" produced by a
PUSHX
instruction must be constructed as follows: grab X many bytes starting at offsetpc + 1
in the bytecode, where, if you run out of bytecode, you replace these bytes with 00's.This PR fixes the missing right padding with 00's in the (rare) case you run out of bytecode.
Recall that we may need to right pad the push value at most once per transaction: when we need to right pad, it means we have no more byte code available, which means the next instruction will be a
STOP
, thus the end of the execution.Given that the gas cost of a
PUSHX
is constant, and that the following opcode is aSTOP
it means also that this value pushed on the stack has no impact nor on the evm execution nor on the committed change to the state.However it profoundly messes with our (Layer2 zkrollup) Linea's arithmetization processing of
PUSH
instructions , as it completely deviates from the EYP (and the execution-specs, as well as Geth.)FYI @OlivierBBB
Fixed Issue(s)
Thanks for sending a pull request! Have you done the following?
doc-change-required
label to this PR if updates are required.Locally, you can run these tests to catch failures early:
./gradlew build
./gradlew acceptanceTest
./gradlew integrationTest
./gradlew ethereum:referenceTests:referenceTests