-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [op-main] Gravity batch is stuck in redelivery #959
Comments
The problem seems to derive from the use of multiple nonces throughout the gravity module. Most importantly, there are 2 different kinds of nonces: The design behind it seems to originate right from the gravity contract, that would emit a nonce as part of its observation events. Our contract doesn't do that. Instead, each Pigeon is using its own, dedicated nonce for relaying gravity batches, and only for that. In addition to that, Paloma keeps track of the latest observed event nonce it received from the Pigeons, which is in line with a universal source of truth, where the nonce would originate directly from the smart contract. But for us, it doesn't. It's just an arbitrary nonce that's increased every time a Pigeon calls in with a gravity update. Since Paloma keeps track of the latest nonces, it assumes all older nonces to be replays of older events and discards them. However, there seems to be absolutely 0 reconciliation or synching in place between nonces. Meaning, if Pigeons don't relay and observe, i.e. because they're jailed, offline, RPC issues, whatever ... they might not see certain batches on the target chain. Since they're not around to attest those events, they won't increase their nonce. The more often this happens, the more out of sync nonces becomes. Take a look at this example block: It contains 2 transactions, both are attest messages to an observed event. The event observed is identical (notice the identical batch ID and eth block height), but the event nonce is almost 100 units apart! This means Paloma will not treat those as attests to the same event. Instead, it will likely just completely ignore them, as the event nonce around that time was well over 1000 already. Proposed solutionNeeds some brain storming, but I would suggest that the nonce in question needs to be synched against the latest observed nonce ever so often. |
@byte-bandit once this #996 is complete, what are the outstanding items that are required on pigeon (and potentially paloma)? |
At the minimum: pigeon will need to change to look for the updated event that includes the event nonce |
@byte-bandit what are the next steps for this ticket to close? |
@taariq After we release I will do that as part of recovering the test net. Once it's done, I will update this ticket and close. |
@taariq Batch is still stuck because it hasn't yet been redelivered since the compass was upgraded: https://optimistic.etherscan.io/address/0xC137433e767bdDE473511B84df834e5D13389015 Likely most validators are lacking the funds, including ourselves. |
There's also this one that might prevent Pigeons from seeing the event in question: #1023 (though it might reach optimistic consistency and just cause a lot of log noise) |
Batch still not removed, but also no attempted relay yet. Going to keep an eye on the assignment. |
So, there's another bug in the gravity bridge that became apparent after we're double checking the target chain valset ID before we send a transaction to it. For SLC, we always create a JIT valset update along with the SLC call and make sure that is processed successfully BEFORE we publish the SLC. For gravity batches, this doesn't happen (yet). We will need to change how messages are emitted and handled on Paloma side for this. I think we can do it similar to the SLC solution. Publish JIT update alongside, then "hide" any outgoing batches from relayers until the chain has no pending valset update messages in the queue any longer. The change shouldn't be large, but I haven't worked with that side of the code a lot recently and will need to familiarise myself with it again. |
Thanks for the update on this. Okay to reflect: New bug in Gravity Bridge detected.
Did I get that? Agreed and aligned. @verabehr will you create a ticket for this and put that ticket in TO-DO under Gravity? We'll get to it after Pigeon Feed. @byte-bandit any objections to closing this ticket? |
@taariq Batch is still stuck. I'd leave it open for now. Maybe move to blocked until we solve the bug I outlined. Also, we're going to need to address this before or along with Pigeon feed, as that will require a working bridge I think. |
|
@byte-bandit moving this back into Progress while we wait on compass upgrade. |
@byte-bandit status on this? |
I have not started looking into this again, will update once I finish with this attestation enforcement epic. |
Closing this as it's no longer relevant with new test net. |
No description provided.
The text was updated successfully, but these errors were encountered: