Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

guide: collator networking & subsystems #1452

Merged
merged 17 commits into from
Jul 31, 2020
Merged

Conversation

rphmeier
Copy link
Contributor

@rphmeier rphmeier commented Jul 22, 2020

Closes #1355

Keeping #1348 in mind, although not yet specifying the features of incremental validation, whitelist/blacklisting, heatmaps, etc. which I think bear further discussion & roll-out plans as well as lower-level design work and interfaces. So this PR is primarily setting up the basic logic and structure the code will have.

TODO:

  • Note validator look-up and discovery that collators use
  • Describe pull protocol for collations
  • Network bridge: expand to introduce a split between underlying network protocols for validators & collators, utilities for peer discovery.

@rphmeier rphmeier added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Jul 22, 2020
@rphmeier rphmeier added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Jul 24, 2020
@rphmeier rphmeier marked this pull request as ready for review July 24, 2020 01:09
@rphmeier rphmeier requested a review from mxinden July 25, 2020 16:26
- Determine the DHT keys to use for each validator based on the relay-chain state and Runtime API.
- Recover the Peer IDs of the validators from the DHT. There may be more than one peer ID per validator.
- Accumulate all `(ValidatorId, PeerId)` pairs and send on the response channel.
- Feed all Peer IDs to the discovery utility the underlying network provides.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant with discovery utility here? Would the line below work as well?

Suggested change
- Feed all Peer IDs to the discovery utility the underlying network provides.
- Add one `PeerId` per validator as a priority group to the `PeerSet`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, but I talked with Pierre and we wanted to do away with this priority groups thing. The previous line seems more general and can be adapted based on what is actually done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide in general leans away from implementation details of Substrate. I believe priority groups are such a detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Feed all Peer IDs to the discovery utility the underlying network provides.
- Feed all Peer IDs to the peer set manager the underlying network provides.

In case we want to keep it generic I would suggest the following. I don't think discovery utility is the right term here as the peer has already been discovered at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #1535

Comment on lines +36 to +39
/// Request the advertised collation at that relay-parent.
RequestCollation(RequestId, Hash, ParaId),
/// A requested collation.
Collation(RequestId, CandidateReceipt, PoV),
Copy link
Contributor

@tomaka tomaka Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should instead be a request-response-style protocol, but I suppose this can be changed later?
As long as collators are honest, it's ok to send collations through notifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can be changed later. As I understand, request/response protocols are not implemented yet.

@rphmeier rphmeier merged commit 0bcb6f9 into master Jul 31, 2020
@rphmeier rphmeier deleted the rh-guide-collator-networking branch July 31, 2020 15:07
ordian added a commit that referenced this pull request Jul 31, 2020
* master:
  guide: collator networking & subsystems (#1452)
  Guide: add a diagram for Inclusion Pipeline & Approval Subsystem (#1457)
  [CI] Build wasm blob with srtool and include prop hashes and blobs in release notes (#1506)
Copy link
Contributor

@infinity0 infinity0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, main thing missing is talking to other parachain validators.

/// Peer-sets handled by the network bridge.
enum PeerSet {
/// The collation peer-set is used to distribute collations from collators to validators.
Collation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I am interpreting "PeerSet" incorrectly, but with collator networking for validators there are actually two types of neighbours:

Then there is "Passing to the relay chain" which is done on the main relay chain gossip protocol that already exists, with its own set of neighbours that (as you note below) can include non-validators, and also validators on another parachain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what's specified, we only need collator<>validator communication. The validator<>validator aspects of distributing whitelists etc. are not handled in this version.

Other aspects of parachain networking (distributing PoVs among parachain group, gossiping statements) are beyond the scope of this subsystem and do indeed use the Validation peerset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context, PoV blocks are not supposed to be distributed to the Validation peerset; that is the purpose of the A&V protocol. Here they are only supposed to be distributed to other parachain validators, so that they can sign attestations for them, and is less traffic overall.

If we do not have this component, then parachain collators will need to send the same PoV block to multiple parachain validators, in order to achieve the minimum number of attestations needed for the block production protocol. By having parachain validators also pass this between each other, we alleviate this bottleneck. I think this is fairly important to have even in an early version of the protocol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you are using both the terms "collation" and "PoV block" separately. Are you saying they are different things? Because I understood them to be the same thing.

Copy link
Contributor Author

@rphmeier rphmeier Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here they are only supposed to be distributed to other parachain validators, so that they can sign attestations for them, and is less traffic overall.

Yes, I understood this. And it is covered by another part of the code, as I mentioned. The flow is Collator->Validator. Validator chooses collation to second (Candidate Selection). Signs an attestation (Candidate Backing) and circulates the PoV to other members of the same group (PoV Distribution). Other members of the group validate and sign attestations (Candidate Backing). Signing an attestation also implies keeping the data available. If the candidate is backed, then Availability Distribution is used to distribute the erasure-coded pieces.

Also, you are using both the terms "collation" and "PoV block" separately. Are you saying they are different things? Because I understood them to be the same thing.

Collation is (CandidateReceipt, PoV)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understood this. And it is covered by another part of the code, as I mentioned.

OK, what you said makes sense although I am still confused by the docstring on "Validation", it says:

This may include nodes which are not validators, as some protocols on this peer-set are expected to be gossip.

This makes sense for the main gossip network on the relay chain, for GRANDPA/BABE etc. However it does not make sense for PoV distribution - here you are only distributing to other parachain validators, not non-validators nor validators on other parachains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infinity0 We can't guarantee at this point that we have transitive connection among the validator set. I guarantee that, in practice, if we excluded full nodes we would probably not achieve parachain liveness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans to guarantee these connections? PoV blocks can get quite large, so gossiping these via non-validators will add a lot of latency - and also non-validator full nodes are untrusted, so this allows them to spam validators that are hoping to receive these objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the data are authenticated, because they must be presented alongside a Seconded candidate by a validator. Barring validator equivocations, the amount of data is bounded.

## Functionality

## Jobs, if any
The process of generating a collation for a parachain is very parachain-specific. As such, the details of how to do so are left beyond the scope of this description. The subsystem should be implemented as an abstract wrapper, which is aware of this configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are validation/pre-validation functions mentioned anywhere else? If not they could go here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, they would go here. However I've deferred that to a later point.

key: CollatorPair,
collation_producer: Fn(params) -> async (HeadData, Vec<UpwardMessage>, PoV),
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intending that collator nodes extend the polkadot client & recompile? It might be easier to have a local parachain-specific process talk to a polkadot node acting as a collator via some interprocess API, but that's a discussion for much later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you intending that collator nodes extend the polkadot client & recompile

Basically, yes. That's how the current Cumulus architecture works, at least. Some IPC-based split would also be amenable to this approach, with the collation_producer yielding an IPC future.

When acting on an advertisement, we issue a `WireMessage::RequestCollation`. If the request times out, we need to note the collator as being unreliable and reduce its priority relative to other collators. And then make another request - repeat until we get a response or the chain has moved on.

As a validator, once the collation has been fetched some other subsystem will inspect and do deeper validation of the collation. The subsystem will report to this subsystem with a [`CollatorProtocolMessage`][CPM]`::ReportCollator` or `NoteGoodCollation` message. In that case, if we are connected directly to the collator, we apply a cost to the `PeerId` associated with the collator and potentially disconnect or blacklist it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my other comment on PeerSet, validators should also be connected to a few other validators on the same parachain, and forward received collations onto them. (The details of neighbour selection are mentioned in #1348)

Other validators are more trusted than collators, so the protocol wire message here only needs to consist of the actual Collation, but if it's simpler to re-use the collator-validator wire message for now, it wouldn't hurt. However in the future once we get onto passing around whitelists/blacklists of collators, the message types would have to diverge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my other comment on PeerSet, validators should also be connected to a few other validators on the same parachain, and forward received collations onto them

Agreed, but this is handled by the PoV distribution and Statement distribution subsystems, not collation distribution.

Messages received by the [Collator Protocol subsystem](../node/collators/collator-protocol.md)

```rust
enum CollatorProtocolMessage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Protocol" suggests that this message is passed to other nodes, and is somewhat stable, but as I understand this is just an internal message between subsystems and could change whenever. How about CollatorInfoMessage or just CollatorInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsystem is called the Collator Protocol subsystem. Our naming convention for these message types is format!("{}Message", subsystem_name)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Guide: Collation Distribution Subsystem
4 participants