validator discovery improvements #2460

ordian · 2021-02-17T08:21:28Z

As discussed in Element, we have several things that can be improved with the current validator discovery implementation:

[minor] connect_to_validators accepts a Vec<ValidatorId>, we can accept a Vec<ValidatorIndex> instead, this would avoid a couple of runtime calls and simplify implementation a bit.
[moderate] authority discovery worker in substrate only holds keys for the current and the next session, meaning that historical validator queries will silently fail. We'll need to fix authority-discovery: Keep the validator ID <-> PeerId mapping of past sessions substrate#6910.
[moderate] if the authority discovery worker returns None here, we'll only attempt to add an AuthorityDiscoveryId to a peerset when it's requested again in a ConnectToValidators request. Not sure what we can do about. Note that the substrate side will only update its cache in 10 minutes.

The text was updated successfully, but these errors were encountered:

ordian · 2021-02-17T20:58:30Z

for p. 2, the problem is that authority discovery workers only retains keys of the RuntimeApi::authorities(), which is currently includes authorities of the current and the next session

as suggested by @rphmeier we can tweak the semantics of RuntimeApi::authorities() to include authorities not only of the current and the next session, but also of some past sessions using session_info module, this API is implemented here for polkadot

polkadot/runtime/polkadot/src/lib.rs

Line 1284 in f778e52

AuthorityDiscovery::authorities()

parity2305 · 2021-02-22T12:23:01Z

@ordian I see that session_info module is part of parachain's runtime. Is it okay to reference it (and initialize it) within the relay chain's runtime?

ordian · 2021-02-22T12:49:08Z

@parity2305 thanks for looking into this. We'll have to enable it only on rococo for now
so it will go here

polkadot/runtime/rococo/src/lib.rs

Line 848 in f778e52

AuthorityDiscovery::authorities()

probably need a helper function in runtime/parachains/src/runtime_api_impl/v1.rs

parity2305 · 2021-02-24T13:14:05Z

@ordian So for point 1:
I do not see any real advantage of using ValidatorIndex instead of ValidatorId here.

There are two usage of connect_to_validators:

polkadot/node/network/pov-distribution/src/lib.rs

Line 320 in c66049d

match validator_discovery::connect_to_validators(

Straight forward, We can save one runtime call.
polkadot/node/network/collator-protocol/src/collator_side.rs

Line 415 in c66049d

let request = validator_discovery::connect_to_validators(

ValidatorId is used in multiple places here in the parent functions, if we remove runtime call to retrieve the ValidatorIds, we will have to change ValidatorGroup data structure also, which is non-minor change.

ordian · 2021-02-24T14:04:24Z

Thanks for taking a look.
Our initial thinking was it would simplify the implementation. But, indeed, looks like the collator side would not benefit from this change as it actually uses the output of validator discovery ((PeerId, ValidatorId)). I don't know if would be worth adding another API just for the sake of minor simplification cc @eskimor. Probably not.

parity2305 · 2021-02-24T14:20:37Z

Thanks for taking a look.
Our initial thinking was it would simplify the implementation. But, indeed, looks like the collator side would not benefit from this change as it actually uses the output of validator discovery ((PeerId, ValidatorId)). I don't know if would be worth adding another API just for the sake of minor simplification cc @eskimor. Probably not.

Correct. Plus Inserting a key value in our_validators_groups will anyway require current Vec<ValidatorIds. So, if we change the impl of connect_to_validators we will need to change impl of ValidatorGroup struct.

eskimor · 2021-02-24T16:52:28Z

I just looked at the implementation, noticing that all that connect_to_validators is doing is filtering validators by the given id to get back the index, in order to retrieve the AuthorityDiscoveryId matching that index. So the question arises, why not pass in AuthorityDiscoveryId or the ValidatorIndex directly.

ValidatorIndex is probably the thing to go when talking about "the validator", all information about a validator can easily be retrieved when having a ValidatorIndex. Going from ValidatorId to AuthorityDiscoveryId is a bit clumsy.

Ideally in my opinion, the API would either take AuthorityDiscoveryId ... because that what it needs and that is also expected by the caller, as it is asking for a network connection or we settle on ValidatorIndex as a general way of specifying a particular validator where everyone can easily lookup information it needs about it.

All in all not a big deal of course, in my particular case I had a ValidatorIndex ready already and found it suboptimal, that I would had to go into runtime to fetch the id, just for the other side clumsily restoring the ValidatorIndex, I already had. It is certainly not worth a costly refactor, but ideally would be considered when doing a refactor anyway.

parity2305 · 2021-02-25T10:47:28Z

@ordian For point 3, We can store ValidatorIds that the authority discovery wasn't able to resolve inside a priority queue, where priority is proportional to how many times a particular ValidatorId is requested to be connected. And have a light job running every x (where x>=10) minutes, that try to resolve ValidatorIds and if successful removes them from the queue and add corresponding peer id(s) to reserved set.

ordian · 2021-02-25T11:18:36Z

@parity2305 yeah, I was thinking of something similar, we either need a background task polling authority discovery service every few minutes, or a way to subscribe to authority discovery notifications when a new id is discovered (that would be harder to implement though, but it would be more efficient). For the background task case, we'd need to poll it for every unresolved authority id I guess, so the priority here doesn't really matter, or it can be simply round-robin queue.

parity2305 · 2021-02-26T16:10:13Z

@ordian Cool. I have started working on adding a background job to validator_discovery using ctx.spawn. Will open a draft PR soon.

eskimor · 2021-02-26T16:25:08Z

3\. [moderate] if the authority discovery worker returns None [here](https://github.com/paritytech/polkadot/blob/490cbd72fc341973a34acb56a73da50dcd07ee0f/node/network/bridge/src/validator_discovery.rs#L238), we'll only attempt to add an `AuthorityDiscoveryId` to a peerset when it's requested again in a `ConnectToValidators` request. Not sure what we can do about. Note that the substrate side will only update its cache in [10 minutes](https://github.com/paritytech/substrate/blob/13b699868330fe60c272eeda1c9b326b0e222173/client/authority-discovery/src/lib.rs#L58).

Actually, as the discovery happens for the current and the next session - shouldn't this be fine? Like if discovery was not able to discover the peer for a complete session, it probably won't a few minutes later either.

ordian · 2021-03-01T09:32:04Z

3\. [moderate] if the authority discovery worker returns None [here](https://github.com/paritytech/polkadot/blob/490cbd72fc341973a34acb56a73da50dcd07ee0f/node/network/bridge/src/validator_discovery.rs#L238), we'll only attempt to add an `AuthorityDiscoveryId` to a peerset when it's requested again in a `ConnectToValidators` request. Not sure what we can do about. Note that the substrate side will only update its cache in [10 minutes](https://github.com/paritytech/substrate/blob/13b699868330fe60c272eeda1c9b326b0e222173/client/authority-discovery/src/lib.rs#L58).

Actually, as the discovery happens for the current and the next session - shouldn't this be fine? Like if discovery was not able to discover the peer for a complete session, it probably won't a few minutes later either.

as the discovery happens for the current and the next session - shouldn't this be fine

this is just a filter that is used to retain ids in the cache

Like if discovery was not able to discover the peer for a complete session

Discovery is happening continuously by listening to the DHT events. If we happen to query it at some point at it returns None here, we won't query it again until the same validator is requested again. However, a discovery might happen after this event or are you saying it can't?

ordian · 2021-03-01T09:35:28Z

Also the default config discovery parameters are not set in stone and we might want to tweak them for polkadot needs.

eskimor · 2021-03-01T09:56:48Z

It retains still active authorities in cache and also issues queries for them. Which happens continuously for validators of the current and the next session. So the cache will be filled with authorities from the next session all along the current session, I would assume the cache to be nicely populated when we start requesting for authorities on session change, as it has been filled already in the previous one (where we did not care about them yet). This is true, except maybe on startup .... not sure if that is an issue though, as a node will have some ramp up time anyway?

To me it looks, like the current implementation is totally fine. The cache represents our view of discoverable validators, sure that view can change over time, but I believe we can treat any particular query as the current status quo and be good with it.

eskimor · 2021-03-01T10:04:51Z

Discovery is happening continuously by listening to the DHT events. If we happen to query it at some point at it returns None here, we won't query it again until the same validator is requested again. However, a discovery might happen after this event or are you saying it can't?

It could, but would argue that it is pretty unlikely and we should not need to care. The cache should make sure we don't need to care and I think it does. There can be network issues of course, but we could also have them when connecting to the validator, that's just something we need to be able to deal with, and I think we do, as we have redundancy in all our protocols.

From an attack surface perspective: If an adversary was able to prevent discovery for a whole session, it could very likely prevent it for another session as well. So re-doing failed lookups would not help here either.

ordian · 2021-03-01T11:03:14Z

Alright, so our assumptions are

We don't issue connection requests to the validators in the next session.
Authority Discovery fills up the cache for the next session before it begins. Except on startup (we also care about a few past sessions), but we allow some time for bootstrap.

If these assumptions are to stay in the future, then I agree we don't need to additional polling mechanism.

eskimor · 2021-03-01T12:59:12Z

I guess 1 can be weakened a bit - if we only start requesting connections to validators of the next session, towards the end of the previous one (to make more smooth transitions), we should still be golden.

parity2305 · 2021-03-01T13:58:06Z

Hmm. Agreed. Also with merge of #2494 we are also keeping track of past N sessions in authority discovery, so as per #1461 (comment) every networking subsystem should be able to connect to relevant validators using authority-discovery worker's cache.

I will close my draft PR.

ordian · 2021-03-01T16:01:40Z

Thanks, p.2 was implemented in #2494 and p.1 and p.3 are agreed to be not worth pursuing, so I'm closing this issue.

ordian added the J0-enhancement An additional feature request. label Feb 17, 2021

ordian added this to the Tenleytown milestone Feb 17, 2021

ordian mentioned this issue Feb 22, 2021

authority-discovery: store past session keys paritytech/substrate#8169

Closed

ParthDesai mentioned this issue Feb 22, 2021

Override RuntimeApi::authorities() to include past sessions #2494

Merged

ordian mentioned this issue Feb 26, 2021

validator_discovery: some logs to help identify discovery issues #2533

Merged

parity2305 mentioned this issue Mar 1, 2021

validator_discovery: Add authority discovery polling job #2545

Closed

2 tasks

ordian closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validator discovery improvements #2460

validator discovery improvements #2460

ordian commented Feb 17, 2021

ordian commented Feb 17, 2021

parity2305 commented Feb 22, 2021

ordian commented Feb 22, 2021

parity2305 commented Feb 24, 2021 •

edited

Loading

ordian commented Feb 24, 2021

parity2305 commented Feb 24, 2021

eskimor commented Feb 24, 2021

parity2305 commented Feb 25, 2021 •

edited

Loading

ordian commented Feb 25, 2021

parity2305 commented Feb 26, 2021

eskimor commented Feb 26, 2021

ordian commented Mar 1, 2021

ordian commented Mar 1, 2021

eskimor commented Mar 1, 2021

eskimor commented Mar 1, 2021

ordian commented Mar 1, 2021

eskimor commented Mar 1, 2021

parity2305 commented Mar 1, 2021 •

edited

Loading

ordian commented Mar 1, 2021

validator discovery improvements #2460

validator discovery improvements #2460

Comments

ordian commented Feb 17, 2021

ordian commented Feb 17, 2021

parity2305 commented Feb 22, 2021

ordian commented Feb 22, 2021

parity2305 commented Feb 24, 2021 • edited Loading

ordian commented Feb 24, 2021

parity2305 commented Feb 24, 2021

eskimor commented Feb 24, 2021

parity2305 commented Feb 25, 2021 • edited Loading

ordian commented Feb 25, 2021

parity2305 commented Feb 26, 2021

eskimor commented Feb 26, 2021

ordian commented Mar 1, 2021

ordian commented Mar 1, 2021

eskimor commented Mar 1, 2021

eskimor commented Mar 1, 2021

ordian commented Mar 1, 2021

eskimor commented Mar 1, 2021

parity2305 commented Mar 1, 2021 • edited Loading

ordian commented Mar 1, 2021

parity2305 commented Feb 24, 2021 •

edited

Loading

parity2305 commented Feb 25, 2021 •

edited

Loading

parity2305 commented Mar 1, 2021 •

edited

Loading