[BUG] [op-main] Gravity batch is stuck in redelivery #959

verabehr · 2023-11-06T14:22:17Z

No description provided.

byte-bandit · 2023-11-06T17:01:50Z

The problem seems to derive from the use of multiple nonces throughout the gravity module. Most importantly, there are 2 different kinds of nonces: BatchNonce and EventNonce. While the former is used to identify a specific version of the batch being relayed, the second one is a bit more nebulous.

The design behind it seems to originate right from the gravity contract, that would emit a nonce as part of its observation events. Our contract doesn't do that. Instead, each Pigeon is using its own, dedicated nonce for relaying gravity batches, and only for that.

In addition to that, Paloma keeps track of the latest observed event nonce it received from the Pigeons, which is in line with a universal source of truth, where the nonce would originate directly from the smart contract. But for us, it doesn't. It's just an arbitrary nonce that's increased every time a Pigeon calls in with a gravity update.

Since Paloma keeps track of the latest nonces, it assumes all older nonces to be replays of older events and discards them. However, there seems to be absolutely 0 reconciliation or synching in place between nonces. Meaning, if Pigeons don't relay and observe, i.e. because they're jailed, offline, RPC issues, whatever ... they might not see certain batches on the target chain. Since they're not around to attest those events, they won't increase their nonce.

The more often this happens, the more out of sync nonces becomes. Take a look at this example block:
https://testnet.paloma.explorers.guru/block/7521077

It contains 2 transactions, both are attest messages to an observed event. The event observed is identical (notice the identical batch ID and eth block height), but the event nonce is almost 100 units apart! This means Paloma will not treat those as attests to the same event. Instead, it will likely just completely ignore them, as the event nonce around that time was well over 1000 already.

Proposed solution

Needs some brain storming, but I would suggest that the nonce in question needs to be synched against the latest observed nonce ever so often.

verabehr · 2023-11-14T14:48:55Z

@byte-bandit once this #996 is complete, what are the outstanding items that are required on pigeon (and potentially paloma)?

verabehr · 2023-11-14T15:30:09Z

At the minimum: pigeon will need to change to look for the updated event that includes the event nonce
TBD: Paloma changes if the data from the new event is not populated everywhere where it should and agreement on the event nonce can be achieved

taariq · 2023-11-28T14:44:01Z

@byte-bandit what are the next steps for this ticket to close?

byte-bandit · 2023-11-28T14:46:25Z

@taariq After we release 1.10.1 and the network has recovered with most validators being unjailed, we need to confirm that the batch is indeed being removed.

I will do that as part of recovering the test net. Once it's done, I will update this ticket and close.

byte-bandit · 2023-11-30T12:00:30Z

@taariq Batch is still stuck because it hasn't yet been redelivered since the compass was upgraded: https://optimistic.etherscan.io/address/0xC137433e767bdDE473511B84df834e5D13389015

Likely most validators are lacking the funds, including ourselves.

byte-bandit · 2023-11-30T12:01:19Z

There's also this one that might prevent Pigeons from seeing the event in question: #1023 (though it might reach optimistic consistency and just cause a lot of log noise)

byte-bandit · 2023-12-01T08:54:22Z

Batch still not removed, but also no attempted relay yet. Going to keep an eye on the assignment.

byte-bandit · 2023-12-01T10:04:53Z

@taariq @verabehr

So, there's another bug in the gravity bridge that became apparent after we're double checking the target chain valset ID before we send a transaction to it.

For SLC, we always create a JIT valset update along with the SLC call and make sure that is processed successfully BEFORE we publish the SLC.

For gravity batches, this doesn't happen (yet). We will need to change how messages are emitted and handled on Paloma side for this.

I think we can do it similar to the SLC solution. Publish JIT update alongside, then "hide" any outgoing batches from relayers until the chain has no pending valset update messages in the queue any longer.

The change shouldn't be large, but I haven't worked with that side of the code a lot recently and will need to familiarise myself with it again.

taariq · 2023-12-01T11:35:02Z

Thanks for the update on this. Okay to reflect: New bug in Gravity Bridge detected.

We create JIT valsets with the SLC call to make sure that we are able to do the SLC.
We did not create the feature for Gravity Batch calls
We need to make this update.

Did I get that? Agreed and aligned. @verabehr will you create a ticket for this and put that ticket in TO-DO under Gravity? We'll get to it after Pigeon Feed.

@byte-bandit any objections to closing this ticket?

byte-bandit · 2023-12-01T11:40:43Z

@taariq Batch is still stuck. I'd leave it open for now. Maybe move to blocked until we solve the bug I outlined.

Also, we're going to need to address this before or along with Pigeon feed, as that will require a working bridge I think.

taariq · 2023-12-01T11:48:07Z

Blocked is totally doable.
Let's leave Gravity Pigeon Feed upgrades to after we get the SLC payments underway. This is the same flow as how we got to Gravity Bridge after SLC was started and in motion in Paloma. The Bridge is less important than the relay economy as we are unable to scale throughput without pre-payments.

taariq · 2024-03-01T17:56:24Z

@byte-bandit moving this back into Progress while we wait on compass upgrade.

taariq · 2024-03-07T14:46:59Z

@byte-bandit status on this?

byte-bandit · 2024-03-08T08:55:57Z

I have not started looking into this again, will update once I finish with this attestation enforcement epic.

byte-bandit · 2024-04-16T10:24:00Z

Closing this as it's no longer relevant with new test net.

verabehr assigned byte-bandit Nov 6, 2023

taariq added bug Something isn't working Skyway labels Nov 6, 2023

byte-bandit mentioned this issue Nov 10, 2023

feat: add query for last observed event nonce palomachain/paloma#1029

Merged

2 tasks

byte-bandit changed the title ~~Bug: optimism batch message is not getting removed~~ [BUG] [op-main] Gravity batch is stuck in redelivery Nov 15, 2023

byte-bandit mentioned this issue Nov 20, 2023

feat: consume event nonce from compass palomachain/pigeon#337

Merged

3 tasks

byte-bandit closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [op-main] Gravity batch is stuck in redelivery #959

[BUG] [op-main] Gravity batch is stuck in redelivery #959

verabehr commented Nov 6, 2023

byte-bandit commented Nov 6, 2023

verabehr commented Nov 14, 2023

verabehr commented Nov 14, 2023 •

edited

Loading

taariq commented Nov 28, 2023

byte-bandit commented Nov 28, 2023

byte-bandit commented Nov 30, 2023

byte-bandit commented Nov 30, 2023

byte-bandit commented Dec 1, 2023

byte-bandit commented Dec 1, 2023

taariq commented Dec 1, 2023

byte-bandit commented Dec 1, 2023

taariq commented Dec 1, 2023

taariq commented Mar 1, 2024

taariq commented Mar 7, 2024

byte-bandit commented Mar 8, 2024

byte-bandit commented Apr 16, 2024

[BUG] [op-main] Gravity batch is stuck in redelivery #959

[BUG] [op-main] Gravity batch is stuck in redelivery #959

Comments

verabehr commented Nov 6, 2023

byte-bandit commented Nov 6, 2023

Proposed solution

verabehr commented Nov 14, 2023

verabehr commented Nov 14, 2023 • edited Loading

taariq commented Nov 28, 2023

byte-bandit commented Nov 28, 2023

byte-bandit commented Nov 30, 2023

byte-bandit commented Nov 30, 2023

byte-bandit commented Dec 1, 2023

byte-bandit commented Dec 1, 2023

taariq commented Dec 1, 2023

byte-bandit commented Dec 1, 2023

taariq commented Dec 1, 2023

taariq commented Mar 1, 2024

taariq commented Mar 7, 2024

byte-bandit commented Mar 8, 2024

byte-bandit commented Apr 16, 2024

verabehr commented Nov 14, 2023 •

edited

Loading