Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [op-main] Gravity batch is stuck in redelivery #959

Closed
verabehr opened this issue Nov 6, 2023 · 16 comments
Closed

[BUG] [op-main] Gravity batch is stuck in redelivery #959

verabehr opened this issue Nov 6, 2023 · 16 comments
Assignees
Labels
bug Something isn't working Skyway

Comments

@verabehr
Copy link

verabehr commented Nov 6, 2023

No description provided.

@taariq taariq added bug Something isn't working Skyway labels Nov 6, 2023
@byte-bandit
Copy link

The problem seems to derive from the use of multiple nonces throughout the gravity module. Most importantly, there are 2 different kinds of nonces: BatchNonce and EventNonce. While the former is used to identify a specific version of the batch being relayed, the second one is a bit more nebulous.

The design behind it seems to originate right from the gravity contract, that would emit a nonce as part of its observation events. Our contract doesn't do that. Instead, each Pigeon is using its own, dedicated nonce for relaying gravity batches, and only for that.

In addition to that, Paloma keeps track of the latest observed event nonce it received from the Pigeons, which is in line with a universal source of truth, where the nonce would originate directly from the smart contract. But for us, it doesn't. It's just an arbitrary nonce that's increased every time a Pigeon calls in with a gravity update.

Since Paloma keeps track of the latest nonces, it assumes all older nonces to be replays of older events and discards them. However, there seems to be absolutely 0 reconciliation or synching in place between nonces. Meaning, if Pigeons don't relay and observe, i.e. because they're jailed, offline, RPC issues, whatever ... they might not see certain batches on the target chain. Since they're not around to attest those events, they won't increase their nonce.

The more often this happens, the more out of sync nonces becomes. Take a look at this example block:
https://testnet.paloma.explorers.guru/block/7521077

It contains 2 transactions, both are attest messages to an observed event. The event observed is identical (notice the identical batch ID and eth block height), but the event nonce is almost 100 units apart! This means Paloma will not treat those as attests to the same event. Instead, it will likely just completely ignore them, as the event nonce around that time was well over 1000 already.


Proposed solution

Needs some brain storming, but I would suggest that the nonce in question needs to be synched against the latest observed nonce ever so often.

@verabehr
Copy link
Author

@byte-bandit once this #996 is complete, what are the outstanding items that are required on pigeon (and potentially paloma)?

@verabehr
Copy link
Author

verabehr commented Nov 14, 2023

At the minimum: pigeon will need to change to look for the updated event that includes the event nonce
TBD: Paloma changes if the data from the new event is not populated everywhere where it should and agreement on the event nonce can be achieved

@byte-bandit byte-bandit changed the title Bug: optimism batch message is not getting removed [BUG] [op-main] Gravity batch is stuck in redelivery Nov 15, 2023
@taariq
Copy link
Member

taariq commented Nov 28, 2023

@byte-bandit what are the next steps for this ticket to close?

@byte-bandit
Copy link

@taariq After we release 1.10.1 and the network has recovered with most validators being unjailed, we need to confirm that the batch is indeed being removed.

I will do that as part of recovering the test net. Once it's done, I will update this ticket and close.

@byte-bandit
Copy link

@taariq Batch is still stuck because it hasn't yet been redelivered since the compass was upgraded: https://optimistic.etherscan.io/address/0xC137433e767bdDE473511B84df834e5D13389015

Likely most validators are lacking the funds, including ourselves.

@byte-bandit
Copy link

There's also this one that might prevent Pigeons from seeing the event in question: #1023 (though it might reach optimistic consistency and just cause a lot of log noise)

@byte-bandit
Copy link

Batch still not removed, but also no attempted relay yet. Going to keep an eye on the assignment.

@byte-bandit
Copy link

@taariq @verabehr

So, there's another bug in the gravity bridge that became apparent after we're double checking the target chain valset ID before we send a transaction to it.

For SLC, we always create a JIT valset update along with the SLC call and make sure that is processed successfully BEFORE we publish the SLC.

For gravity batches, this doesn't happen (yet). We will need to change how messages are emitted and handled on Paloma side for this.

I think we can do it similar to the SLC solution. Publish JIT update alongside, then "hide" any outgoing batches from relayers until the chain has no pending valset update messages in the queue any longer.

The change shouldn't be large, but I haven't worked with that side of the code a lot recently and will need to familiarise myself with it again.

@taariq
Copy link
Member

taariq commented Dec 1, 2023

Thanks for the update on this. Okay to reflect: New bug in Gravity Bridge detected.

  1. We create JIT valsets with the SLC call to make sure that we are able to do the SLC.
  2. We did not create the feature for Gravity Batch calls
  3. We need to make this update.

Did I get that? Agreed and aligned. @verabehr will you create a ticket for this and put that ticket in TO-DO under Gravity? We'll get to it after Pigeon Feed.

@byte-bandit any objections to closing this ticket?

@byte-bandit
Copy link

@taariq Batch is still stuck. I'd leave it open for now. Maybe move to blocked until we solve the bug I outlined.

Also, we're going to need to address this before or along with Pigeon feed, as that will require a working bridge I think.

@taariq
Copy link
Member

taariq commented Dec 1, 2023

  1. Blocked is totally doable.
  2. Let's leave Gravity Pigeon Feed upgrades to after we get the SLC payments underway. This is the same flow as how we got to Gravity Bridge after SLC was started and in motion in Paloma. The Bridge is less important than the relay economy as we are unable to scale throughput without pre-payments.

@taariq
Copy link
Member

taariq commented Mar 1, 2024

@byte-bandit moving this back into Progress while we wait on compass upgrade.

@taariq
Copy link
Member

taariq commented Mar 7, 2024

@byte-bandit status on this?

@byte-bandit
Copy link

I have not started looking into this again, will update once I finish with this attestation enforcement epic.

@byte-bandit
Copy link

Closing this as it's no longer relevant with new test net.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Skyway
Projects
None yet
Development

No branches or pull requests

3 participants