Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceUnavailable details #2414

Closed
djrtwo opened this issue May 14, 2021 · 5 comments
Closed

ResourceUnavailable details #2414

djrtwo opened this issue May 14, 2021 · 5 comments
Labels

Comments

@djrtwo
Copy link
Contributor

djrtwo commented May 14, 2021

Based on conversations with @arnetheduck, this error code is a bit ambiguous/under-specified for all of the req/resp message types. Specifically, we need to ensure that the MUSTs and SHOULDs of various calls are sufficient and safe given this new code.

This error code was pulled for the v1.1.0-alpha.4 release (#2413). The plan is to discuss the ambiguities here, and work through these details BY WEDNESDAY MAY 19 in preparation for spec freeze on Friday.

The current path is to include the error code but to ensure it is done safely.

@arnetheduck
Copy link
Contributor

To avoid wasted implementation effort on how the error should be handled and interpreted by all clients, there are a few details to work out - not only do we save work now ("oh, client x responds this way and client y the other"), but detailing these behaviors in the spec will also make future changes more smooth as having a fixed point / behavior to base backwards compatibility on generally is easier.

In particular, for now we've divided the world into "worked" and "didn't work", letting the client side implement logic to guard against faulty and/or malicious responders (which it must do anyway), with "didn't work" being treated as semi-permanent.

More specific error codes are generally useful when there is an expectation that clients will choose a behavior based on that error code - when adding error codes to the spec, we should thus specify under which conditions each request responds with a particular error code, what the expected behavior on the receiving end is, and how to deal with unexpected error codes.

For ResourceUnavailable in particular:

BeaconBlocksByRoot answers queries for missing blocks by simply omitting them - clients reading the response to the request must compute a hash of the the block to validate it so they can always deduce ResourceUnavailable - introducing the error code means we should pick one way to answer here (by enumerating which error codes are valid in the context)

BeaconBlocksByRange - here, ResourceUnavailable is a bit more tricky - to add it to the spec, one would have to define the cases where it should be returned. The spec says:

Clients MUST keep a record of signed blocks seen on the epoch range [max(GENESIS_EPOCH, current_epoch - MIN_EPOCHS_FOR_BLOCK_REQUESTS), current_epoch]

There are therefore two cases where ResourceUnavailable could be used:

  • The client is not following the spec and doesn't make the blocks in the given range available. This can generally be deduced just like with all other block requests: if another client gives a better response, it's known that the client that didn't give blocks doesn't make the resource available - using an error code here basically allows the faulty peer to be kicked earlier because it now admits it's not serving the blocks. This may or may not be a temporary condition (backfilling for example).
  • The request was for a range outside of the required block storage range - this may happen naturally for valid requests for example when time passes between request and response. The "current" way to handle this would be to simply start sending blocks from within the range, and let the receiving client deal with the mess. It's an open question how to handle "partially" missing ranges.

The other way to look at the error in the case of BeaconBlocksByRange is temporary vs permanent - the former happens during backfilling for example where the more specific error code can be used to signal the intent to start complying with the spec later - one open issue with any temporary interpretation is that it increases the incentive to always respond this way: when a peer is syncing, it is not useful to the server so if the syncing peer kicks them temporarily and reconnects when synced, that saves bandwidth for the server. If/when this becomes a problem, we might need to balance it with a policing effort.

One final open question is what to do with light-syncing peers that are not validating, but just listening to some traffic - presumably, they would always respond to requests with ResourceUnavailable - but at the same time, they would likely not be participating in gossip at all. Which peers to allow onto the gossip network is an open research question - gossipsub provides some insight into the quality of gossip peers, but there's little in the spec for guidance.

@ajsutton
Copy link
Contributor

BeaconBlocksByRoot answers queries for missing blocks by simply omitting them - clients reading the response to the request must compute a hash of the the block to validate it so they can always deduce ResourceUnavailable - introducing the error code means we should pick one way to answer here (by enumerating which error codes are valid in the context)

Yep I don't think there's any need for ResourceUnavailable with BeaconBlocksByRoot as not returning any blocks already means "I don't have any of those blocks".

For BeaconBlocksByRange the error code is useful because it distinguishes between "there were no blocks in this range" and "I don't have records for this range so can't answer the query". That would apply both if the blocks were still backfilling or if they'd been pruned because they are outside the block storage range. Since BeaconBlocksByRange requires returning the first block in the range, I'd say that a node MUST return ResourceUnavailable if they are unable to determine and return the first block in the range. In most cases that means if the start of the range is prior to the earliest slot blocks are kept for but may be more complex if the node can have arbitrary ranges of blocks that are unavailable.

Nodes receiving ResourceUnavailable response would immediately know that this peer is not providing an accurate view of the chain (vs getting an empty response which then requires comparing against multiple other peers to determine if it really was empty or the node just didn't return blocks). How the requesting node responds to that really comes down to their approach to sync and peer management policies same as different clients respond differently to getting empty responses today.

I guess I'm unclear on exactly what problems we're trying to work through here because I wasn't privy to the original conversation.

@arnetheduck
Copy link
Contributor

I don't think there's any need

That should be part of any PR that introduces ResourceUnavailable, so as to avoid ambiguities.

For BeaconBlocksByRange the error code is useful because it distinguishes between "there were no blocks in this range" and "I don't have records for this range so can't answer the query".

really was empty

this is the tricky part: any range you ask for might be empty in their view of the chain - the byrange request asks for what the other party considers to be the canonical chain - you have to reconcile that with your local information about what the chain is and what others are sending you, regardless of what they send you (error codes, blocks, empty ranges) - any "the chain is empty" response is no more authoritative than "I don't know, I don't have the block range" in this regard - for example, it doesn't tell you if you should request a different range from the same peer or not (they might have it, or they might not), and it doesn't tell you if you should request the same range from a different peer (it might have blocks or it might not).

For the error code to be useful, one would need to accompany it with a guide as to when to return it exactly, and preferably, how the receiving client should react it - if this is not included, we'll end up in a situation where the error code becomes a client-specific extension that will be difficult to code logic against.

One property of this kind of error code could be used for is to establish who is a freeloader - the network can (and perhaps should) by default support a certain amount of freeloading - the longer term solution discussed here is that we would introduce such information in the ENR so that it's reasonably easy to find appropriate nodes - it might however also make more sense that such nodes stop the request at a libp2p level in the multiplexer.

I'm unclear on exactly what problems we're trying to work through here

It's all in my comment: introducing an error code should be accompanied by the exact conditions under which it's triggered, what the expected behavior on the receiving side is and for which requests it's valid, as part of the spec text. For client-specific or loosely defined errors, there's always the anything-goes >=128 range.

@ajsutton
Copy link
Contributor

this is the tricky part: any range you ask for might be empty in their view of the chain - the byrange request asks for what the other party considers to be the canonical chain - you have to reconcile that with your local information about what the chain is and what others are sending you, regardless of what they send you (error codes, blocks, empty ranges) - any "the chain is empty" response is no more authoritative than "I don't know, I don't have the block range" in this regard - for example, it doesn't tell you if you should request a different range from the same peer or not (they might have it, or they might not), and it doesn't tell you if you should request the same range from a different peer (it might have blocks or it might not).

I think this is our fundamental disagreement. When designing APIs it's really, really useful to have more information than less. Even if you don't trust that result, knowing exactly what's claimed is a big improvement on a vague empty response which could mean no blocks in the range, could mean the blocks aren't available or could be the node being a jerk.

It's ok if some clients treat the error response the same as an empty response, just as it's ok that different clients implement different approaches to syncing. The reason this whole proposal started months ago is because it really would be useful to differentiate between a node claiming there were no blocks in the range and one saying they can't answer that request. If the remote peer explicitly says they don't have the blocks, the requesting peer can disconnect them immediately, knowing that they don't have the blocks required and aren't currently useful. They don't have to disconnect them, there's lots of different things clients might factor into making that decision and it's ok for clients to have different policies on that, just as we have different policies on how to prioritise which peers to connect to so we have coverage of required subnets.

For the error code to be useful, one would need to accompany it with a guide as to when to return it exactly, and preferably, how the receiving client should react it - if this is not included, we'll end up in a situation where the error code becomes a client-specific extension that will be difficult to code logic against.

Agree we should clearly define when to return it. As I mentioned that will be something close to "when the node does not have sufficient information to return the first block in the requested range".

We don't actually define what a client should do with a successful response so it seems odd to have a higher bar for error codes. We also don't define how clients act when they get a InvalidRequest or ServerError response. It would be fine to say that a client MAY disconnect when receiving this response, you could explicitly say they MAY retry it later but I feel like that's pretty redundant since we don't ever say the same request can't be sent multiple times. We can't define how this response should affect the node's syncing algorithm because we're defining a request/response API, not a syncing algorithm.

And yes, given the structure of the spec we should explicitly say this response is only valid for BeaconBlocksByRange.

@djrtwo
Copy link
Contributor Author

djrtwo commented May 21, 2021

PR up at #2430

@djrtwo djrtwo closed this as completed Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants