Relay monitoring & preventing continued relay errors #142

metachris · 2022-06-09T11:19:22Z

Once a proposer calls submitBlindedBlock to a relay (with a signed header), it depends on the relay to release the block to be able to propose anything (no fallback to a local block is possible at that point due to possible slashing).

There's several relay error scenarios:

payload withholding (relay doesn't release the payload and the proposer needs to forfeit the slot)
incorrect payload
a. incorrect value (the final amount paid by the builder to the proposer was different to the amount claimed in the BuilderBid)
b. invalid block (invalid data / fields)

Question: How can we shield proposers from faulty relays, and how to prevent continuous slots with errors due to faulty relay behaviour?

A possible solution is a monitoring service run by a trusted third-party, which we can call Relay Monitor (RM).

Whenever mev-boost calls submitBlindedBlock to a relay, it also sends a request to the RM, including the SignedBuilderBid, the relay it originated from, and the submitBlindedBlock body.
The RM will also request the payload from the relay
Thus the RM can check
a. whether the payload is withheld
b. whether the block matches the bid

If there is any problem, the relay's scoring/reputation is be updated in the RM, and propagated to all connected proposers (by mev-boost polling the relay status endpoint, maybe also push as an option). If any relay behaves incorrectly, all connected proposers can ignore faulty relays for some time. (reputation mechanism TBD).

This (centralised) service can be put into production quickly, and can mitigate a range of issues resulting from faulty relays. It should be run by a trusted party, and could be replaced in the mid- to longer term with a more decentralized/trustless solution.

TBD:

Reputation mechanics: what exactly happens on a single instance of any of the errors?
Who should run a relay monitor, and how many instances are the sweet spot? There's an argument for having a small number, because (a) the more proposers connect to it the more it knows about relay issues, and (b) is has a lot of "power" in that it can blacklist relays.

Tl;dr: A relay monitor could observe any relay problem a validator experiences, and can tell all the other connected validators about problems with a specific relay. Thus, if a relay causes a problem with one validator, all the other connected validators would immediately know, and could avoid that relay for some time (or whatever mechanic).

The text was updated successfully, but these errors were encountered:

metachris · 2022-06-10T07:35:15Z

Here's a rough diagram outlining the setup (src):

lightclient · 2022-06-13T09:53:27Z

Not sure if you saw this doc by Yoav, but I think he has a nice outline for what should be done to keep relays "honest": https://notes.ethereum.org/@yoav/BJeOQ8rI5

A few general thoughts here:

I think the RM should really be multiple actors, otherwise we're not all the much better off than with a relay. Just trusting a different person.
Reputation mechanics -- I think withholding should pretty much be insta-ban. Not sure about invalid block, but also feels like a pretty bad fault.

metachris · 2022-06-13T10:17:49Z

https://notes.ethereum.org/@yoav/BJeOQ8rI5

I think the RM should really be multiple actors, otherwise we're not all the much better off than with a relay. Just trusting a different person.

Good link, it states the problem clearly and hints at a solution based on a committee. It's not yet clear how such a committee would work.

A decentralized setup would definitely be great, and I can see that as a possible next step. It does seems to first require a bunch of work on specification, research and prototyping, to explore the consensus protocol, committee duties and repercussions for malicious behavior.

Reputation mechanics -- I think withholding should pretty much be insta-ban. Not sure about invalid block, but also feels like a pretty bad fault.

Withholding once could actually be a networking issue, I don't think that should be a permaban instantly. Maybe banning for a few hours at first would suffice, and increasing penalties for repeated offenses 🤔

terencechain · 2022-06-30T22:29:00Z

Could relay reply different responses to RM than to mev-boost? I don't see why relay would do this, but just a thought

metachris · 2022-07-01T07:11:07Z

I don't think a relay has a reliable way to distinguish relay vs monitor 🤔 The payload is the same, although maybe the request profile over time is different...

MicahZoltu · 2022-07-28T15:58:32Z

I would like to see one minor change, which is that the proposer node can connect to multiple monitors (not just one), and the monitors can connect and talk to each other. While a full gossip network would be ideal, just having a hub and spoke (decentralized) system where clients can connect to multiple hubs is probably "good enough" to buy us time until a more complete gossip network can be setup and secured.

ralexstokes · 2022-08-02T23:52:39Z

I don't think a relay has a reliable way to distinguish relay vs monitor 🤔 The payload is the same, although maybe the request profile over time is different...

esp if monitors are "trusted third parties" then they will be well-known entities with fairly fixed IPs, relays could definitely use this to discriminate responses although I don't see how this could be gamed right now

metachris · 2022-08-24T10:12:01Z

Linking the current design doc by @ralexstokes: https://hackmd.io/@ralexstokes/SynPJN_pq

sambacha · 2022-08-28T00:16:34Z

For the SecureRpc Relay we are operating, here are some metrics/KPIs that are collected and some diagrams (out of date, but should be slightly helpful)

Metrics collected: https://gist.github.com/sambacha/d613f8be00caa50befe0c7a8e1dda073
Grafana Dashboard screen shot: screencapture-grafana-manifoldx-d-RixFH2jnz-relay-overview-copy-2022-08-26-13_02_03 (Grafana v9.1 allows public dashboards, so we should be able to make this publicly queryable soon)

kailinr · 2022-09-06T19:45:52Z

Linking Alex's relay monitor implementation

MoeMahhouk · 2023-09-07T11:13:49Z

Once a proposer calls submitBlindedBlock to a relay (with a signed header), it depends on the relay to release the block to be able to propose anything (no fallback to a local block is possible at that point due to possible slashing).

There's several relay error scenarios:

payload withholding (relay doesn't release the payload and the proposer needs to forfeit the slot)

incorrect payload
a. incorrect value (the final amount paid by the builder to the proposer was different to the amount claimed in the BuilderBid)
b. invalid block (invalid data / fields)

Question: How can we shield proposers from faulty relays, and how to prevent continuous slots with errors due to faulty relay behaviour?

A possible solution is a monitoring service run by a trusted third-party, which we can call Relay Monitor (RM).

Whenever mev-boost calls submitBlindedBlock to a relay, it also sends a request to the RM, including the SignedBuilderBid, the relay it originated from, and the submitBlindedBlock body.

The RM will also request the payload from the relay

Thus the RM can check
a. whether the payload is withheld
b. whether the block matches the bid

If there is any problem, the relay's scoring/reputation is be updated in the RM, and propagated to all connected proposers (by mev-boost polling the relay status endpoint, maybe also push as an option). If any relay behaves incorrectly, all connected proposers can ignore faulty relays for some time. (reputation mechanism TBD).

This (centralised) service can be put into production quickly, and can mitigate a range of issues resulting from faulty relays. It should be run by a trusted party, and could be replaced in the mid- to longer term with a more decentralized/trustless solution.

TBD:

Reputation mechanics: what exactly happens on a single instance of any of the errors?

Who should run a relay monitor, and how many instances are the sweet spot? There's an argument for having a small number, because (a) the more proposers connect to it the more it knows about relay issues, and (b) is has a lot of "power" in that it can blacklist relays.

Tl;dr: A relay monitor could observe any relay problem a validator experiences, and can tell all the other connected validators about problems with a specific relay. Thus, if a relay causes a problem with one validator, all the other connected validators would immediately know, and could avoid that relay for some time (or whatever mechanic).

Please correct me if I'm wrong, but wouldn't the afformentioned issues be tackled by applying a BFT system for the proposer <-> relay interaction?
I.e., since the error scenarios stated here show that a relay can behave in a byzantine manner, maybe a BFT system where a proposer can send the request to (3f+1) relay nodes and based on a consensus retrieve the correct answer and avoid the issues stated above.
BFT would ensure strong consistency regarding latest payloads and correct values compared to the eventual consistency of gossip network consensus model. Furthermore, the trust model of a BFT system is stricter than gossip network.
On the other hand, the communication overhead of the gossip network is less than the one from BFT.

In Both cases, the caveat here is that it would possible increase the complexity and degrade the performance as a tradeoff.

PS. please excuse my limited knowledge in BFT/Gossip systems in case what I mentioned above is incorrect.

ralexstokes · 2023-09-08T14:57:39Z

this is an interesting avenue for exploration; however, the prevailing model is that relays are not guaranteed to share any of their bids/data so having a BFT style approach across disparate actors doesn't really make sense...

there has been a thread we have been dancing around on the mev-boost community calls for some time that points towards a different model where independent entities do just run some kind of "relay" node and then builders are expected to publish to all of them, e.g. over some kind of gossip net -- and in this case we could imagine some kind of consensus over the "bid pool" that reduces room for byzantine behavior

that being said, the "live" pathways of the relay are incredibly latency sensitive so unless the consensus process really brought substantial benefits I think it would be hard to get adoption

and I think we'd also need to move towards more of an optimistic regime, see something like v3 here: https://github.com/michaelneuder/optimistic-relay-documentation/blob/main/towards-epbs.md

metachris added the brainstorming Currently in discussion label Jun 25, 2022

come-maiz mentioned this issue Jul 8, 2022

Ethereum Core Devs Meeting 142 Agenda ethereum/pm#562

Closed

come-maiz mentioned this issue Jul 21, 2022

Safeguards and mitigations to preserve liveness #222

Open

kailinr added the relay label Aug 1, 2022

come-maiz added this to the safer-merge milestone Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay monitoring & preventing continued relay errors #142

Relay monitoring & preventing continued relay errors #142

metachris commented Jun 9, 2022 •

edited

Loading

metachris commented Jun 10, 2022 •

edited

Loading

lightclient commented Jun 13, 2022

metachris commented Jun 13, 2022

terencechain commented Jun 30, 2022

metachris commented Jul 1, 2022 •

edited

Loading

MicahZoltu commented Jul 28, 2022

ralexstokes commented Aug 2, 2022

metachris commented Aug 24, 2022

sambacha commented Aug 28, 2022 •

edited by metachris

Loading

kailinr commented Sep 6, 2022 •

edited

Loading

MoeMahhouk commented Sep 7, 2023 •

edited

Loading

ralexstokes commented Sep 8, 2023

Relay monitoring & preventing continued relay errors #142

Relay monitoring & preventing continued relay errors #142

Comments

metachris commented Jun 9, 2022 • edited Loading

metachris commented Jun 10, 2022 • edited Loading

lightclient commented Jun 13, 2022

metachris commented Jun 13, 2022

terencechain commented Jun 30, 2022

metachris commented Jul 1, 2022 • edited Loading

MicahZoltu commented Jul 28, 2022

ralexstokes commented Aug 2, 2022

metachris commented Aug 24, 2022

sambacha commented Aug 28, 2022 • edited by metachris Loading

kailinr commented Sep 6, 2022 • edited Loading

MoeMahhouk commented Sep 7, 2023 • edited Loading

ralexstokes commented Sep 8, 2023

metachris commented Jun 9, 2022 •

edited

Loading

metachris commented Jun 10, 2022 •

edited

Loading

metachris commented Jul 1, 2022 •

edited

Loading

sambacha commented Aug 28, 2022 •

edited by metachris

Loading

kailinr commented Sep 6, 2022 •

edited

Loading

MoeMahhouk commented Sep 7, 2023 •

edited

Loading