Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

validator discovery improvements #2460

Closed
ordian opened this issue Feb 17, 2021 · 19 comments
Closed

validator discovery improvements #2460

ordian opened this issue Feb 17, 2021 · 19 comments
Labels
J0-enhancement An additional feature request.
Milestone

Comments

@ordian
Copy link
Member

ordian commented Feb 17, 2021

As discussed in Element, we have several things that can be improved with the current validator discovery implementation:

  1. [minor] connect_to_validators accepts a Vec<ValidatorId>, we can accept a Vec<ValidatorIndex> instead, this would avoid a couple of runtime calls and simplify implementation a bit.
  2. [moderate] authority discovery worker in substrate only holds keys for the current and the next session, meaning that historical validator queries will silently fail. We'll need to fix authority-discovery: Keep the validator ID <-> PeerId mapping of past sessions substrate#6910.
  3. [moderate] if the authority discovery worker returns None here, we'll only attempt to add an AuthorityDiscoveryId to a peerset when it's requested again in a ConnectToValidators request. Not sure what we can do about. Note that the substrate side will only update its cache in 10 minutes.
@ordian ordian added the J0-enhancement An additional feature request. label Feb 17, 2021
@ordian ordian added this to the Tenleytown milestone Feb 17, 2021
@ordian
Copy link
Member Author

ordian commented Feb 17, 2021

for p. 2, the problem is that authority discovery workers only retains keys of the RuntimeApi::authorities(), which is currently includes authorities of the current and the next session

as suggested by @rphmeier we can tweak the semantics of RuntimeApi::authorities() to include authorities not only of the current and the next session, but also of some past sessions using session_info module, this API is implemented here for polkadot

AuthorityDiscovery::authorities()

@parity2305
Copy link
Contributor

@ordian I see that session_info module is part of parachain's runtime. Is it okay to reference it (and initialize it) within the relay chain's runtime?

@ordian
Copy link
Member Author

ordian commented Feb 22, 2021

@parity2305 thanks for looking into this. We'll have to enable it only on rococo for now
so it will go here

AuthorityDiscovery::authorities()

probably need a helper function in runtime/parachains/src/runtime_api_impl/v1.rs

@parity2305
Copy link
Contributor

parity2305 commented Feb 24, 2021

@ordian So for point 1:
I do not see any real advantage of using ValidatorIndex instead of ValidatorId here.

There are two usage of connect_to_validators:

  1. match validator_discovery::connect_to_validators(
    Straight forward, We can save one runtime call.
  2. let request = validator_discovery::connect_to_validators(
    ValidatorId is used in multiple places here in the parent functions, if we remove runtime call to retrieve the ValidatorIds, we will have to change ValidatorGroup data structure also, which is non-minor change.

@ordian
Copy link
Member Author

ordian commented Feb 24, 2021

Thanks for taking a look.
Our initial thinking was it would simplify the implementation. But, indeed, looks like the collator side would not benefit from this change as it actually uses the output of validator discovery ((PeerId, ValidatorId)). I don't know if would be worth adding another API just for the sake of minor simplification cc @eskimor. Probably not.

@parity2305
Copy link
Contributor

Thanks for taking a look.
Our initial thinking was it would simplify the implementation. But, indeed, looks like the collator side would not benefit from this change as it actually uses the output of validator discovery ((PeerId, ValidatorId)). I don't know if would be worth adding another API just for the sake of minor simplification cc @eskimor. Probably not.

Correct. Plus Inserting a key value in our_validators_groups will anyway require current Vec<ValidatorIds. So, if we change the impl of connect_to_validators we will need to change impl of ValidatorGroup struct.

@eskimor
Copy link
Member

eskimor commented Feb 24, 2021

I just looked at the implementation, noticing that all that connect_to_validators is doing is filtering validators by the given id to get back the index, in order to retrieve the AuthorityDiscoveryId matching that index. So the question arises, why not pass in AuthorityDiscoveryId or the ValidatorIndex directly.

ValidatorIndex is probably the thing to go when talking about "the validator", all information about a validator can easily be retrieved when having a ValidatorIndex. Going from ValidatorId to AuthorityDiscoveryId is a bit clumsy.

Ideally in my opinion, the API would either take AuthorityDiscoveryId ... because that what it needs and that is also expected by the caller, as it is asking for a network connection or we settle on ValidatorIndex as a general way of specifying a particular validator where everyone can easily lookup information it needs about it.

All in all not a big deal of course, in my particular case I had a ValidatorIndex ready already and found it suboptimal, that I would had to go into runtime to fetch the id, just for the other side clumsily restoring the ValidatorIndex, I already had. It is certainly not worth a costly refactor, but ideally would be considered when doing a refactor anyway.

@parity2305
Copy link
Contributor

parity2305 commented Feb 25, 2021

@ordian For point 3, We can store ValidatorIds that the authority discovery wasn't able to resolve inside a priority queue, where priority is proportional to how many times a particular ValidatorId is requested to be connected. And have a light job running every x (where x>=10) minutes, that try to resolve ValidatorIds and if successful removes them from the queue and add corresponding peer id(s) to reserved set.

@ordian
Copy link
Member Author

ordian commented Feb 25, 2021

@parity2305 yeah, I was thinking of something similar, we either need a background task polling authority discovery service every few minutes, or a way to subscribe to authority discovery notifications when a new id is discovered (that would be harder to implement though, but it would be more efficient). For the background task case, we'd need to poll it for every unresolved authority id I guess, so the priority here doesn't really matter, or it can be simply round-robin queue.

@parity2305
Copy link
Contributor

@ordian Cool. I have started working on adding a background job to validator_discovery using ctx.spawn. Will open a draft PR soon.

@eskimor
Copy link
Member

eskimor commented Feb 26, 2021

3\. [moderate] if the authority discovery worker returns None [here](https://github.com/paritytech/polkadot/blob/490cbd72fc341973a34acb56a73da50dcd07ee0f/node/network/bridge/src/validator_discovery.rs#L238), we'll only attempt to add an `AuthorityDiscoveryId` to a peerset when it's requested again in a `ConnectToValidators` request. Not sure what we can do about. Note that the substrate side will only update its cache in [10 minutes](https://github.com/paritytech/substrate/blob/13b699868330fe60c272eeda1c9b326b0e222173/client/authority-discovery/src/lib.rs#L58).

Actually, as the discovery happens for the current and the next session - shouldn't this be fine? Like if discovery was not able to discover the peer for a complete session, it probably won't a few minutes later either.

@ordian
Copy link
Member Author

ordian commented Mar 1, 2021

3\. [moderate] if the authority discovery worker returns None [here](https://github.com/paritytech/polkadot/blob/490cbd72fc341973a34acb56a73da50dcd07ee0f/node/network/bridge/src/validator_discovery.rs#L238), we'll only attempt to add an `AuthorityDiscoveryId` to a peerset when it's requested again in a `ConnectToValidators` request. Not sure what we can do about. Note that the substrate side will only update its cache in [10 minutes](https://github.com/paritytech/substrate/blob/13b699868330fe60c272eeda1c9b326b0e222173/client/authority-discovery/src/lib.rs#L58).

Actually, as the discovery happens for the current and the next session - shouldn't this be fine? Like if discovery was not able to discover the peer for a complete session, it probably won't a few minutes later either.

as the discovery happens for the current and the next session - shouldn't this be fine

this is just a filter that is used to retain ids in the cache

Like if discovery was not able to discover the peer for a complete session

Discovery is happening continuously by listening to the DHT events. If we happen to query it at some point at it returns None here, we won't query it again until the same validator is requested again. However, a discovery might happen after this event or are you saying it can't?

@ordian
Copy link
Member Author

ordian commented Mar 1, 2021

Also the default config discovery parameters are not set in stone and we might want to tweak them for polkadot needs.

@eskimor
Copy link
Member

eskimor commented Mar 1, 2021

It retains still active authorities in cache and also issues queries for them. Which happens continuously for validators of the current and the next session. So the cache will be filled with authorities from the next session all along the current session, I would assume the cache to be nicely populated when we start requesting for authorities on session change, as it has been filled already in the previous one (where we did not care about them yet). This is true, except maybe on startup .... not sure if that is an issue though, as a node will have some ramp up time anyway?

To me it looks, like the current implementation is totally fine. The cache represents our view of discoverable validators, sure that view can change over time, but I believe we can treat any particular query as the current status quo and be good with it.

@eskimor
Copy link
Member

eskimor commented Mar 1, 2021

Discovery is happening continuously by listening to the DHT events. If we happen to query it at some point at it returns None here, we won't query it again until the same validator is requested again. However, a discovery might happen after this event or are you saying it can't?

It could, but would argue that it is pretty unlikely and we should not need to care. The cache should make sure we don't need to care and I think it does. There can be network issues of course, but we could also have them when connecting to the validator, that's just something we need to be able to deal with, and I think we do, as we have redundancy in all our protocols.

From an attack surface perspective: If an adversary was able to prevent discovery for a whole session, it could very likely prevent it for another session as well. So re-doing failed lookups would not help here either.

@ordian
Copy link
Member Author

ordian commented Mar 1, 2021

Alright, so our assumptions are

  1. We don't issue connection requests to the validators in the next session.
  2. Authority Discovery fills up the cache for the next session before it begins. Except on startup (we also care about a few past sessions), but we allow some time for bootstrap.

If these assumptions are to stay in the future, then I agree we don't need to additional polling mechanism.

@eskimor
Copy link
Member

eskimor commented Mar 1, 2021

I guess 1 can be weakened a bit - if we only start requesting connections to validators of the next session, towards the end of the previous one (to make more smooth transitions), we should still be golden.

@parity2305
Copy link
Contributor

parity2305 commented Mar 1, 2021

Hmm. Agreed. Also with merge of #2494 we are also keeping track of past N sessions in authority discovery, so as per #1461 (comment) every networking subsystem should be able to connect to relevant validators using authority-discovery worker's cache.

I will close my draft PR.

@ordian
Copy link
Member Author

ordian commented Mar 1, 2021

Thanks, p.2 was implemented in #2494 and p.1 and p.3 are agreed to be not worth pursuing, so I'm closing this issue.

@ordian ordian closed this as completed Mar 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J0-enhancement An additional feature request.
Projects
None yet
Development

No branches or pull requests

3 participants