Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Cryptographic authentication of validators' session keys in network #47

Closed
rphmeier opened this issue Nov 27, 2018 · 11 comments
Closed
Labels
I2-security The node fails to follow expected, security-sensitive, behaviour.

Comments

@rphmeier
Copy link
Contributor

Currently, we just believe someone if they say they control a specific session key. We will need a cryptographic authentication mechanism for this.

paritytech/substrate#271

@rphmeier rphmeier added the I2-security The node fails to follow expected, security-sensitive, behaviour. label Nov 27, 2018
@rphmeier rphmeier added this to the PoC-4 (Interchain Communication) milestone Nov 27, 2018
@burdges
Copy link
Contributor

burdges commented Apr 7, 2019

We might make this roughly

pub struct SessionKeys {
    controller: [u8; 32],
    babe: [u8; 32],
    grandpa: [u8; 48],
    // Ed25519 key for TLS 1.3.  Noise converts to X25519.
    transport: [u8; 32],   
    session_id: ??,
}
pub struct SessionCertificate {
    keys: SessionKeys,
    /// Sr25519 signature over self.keys
    babe: [u8; 64],
    /// BLS signature over self.{keys, babe}
    grandpa: [u8; 96],
    /// sr25519 or ed25519 signature over self.{keys, controller, babe, grandpa}
    controller: [u8; 64],
}

We do not require this proof-of-possession for babe currently, but probably good anyways. We absolutely need the self signature by grandpa for proof-of-possession. I do not see any particular reason for either BABE or GRANDPA to sign the other's signature like described above, but we could do one or the other if we can think of any benefit.

We make the transport key flexible with a tiny merkle tree or hide it by making it a hash. We might ask if TLS likes some certificate format, except it does not like our key formats, so maybe no benefit.

@burdges
Copy link
Contributor

burdges commented Jun 5, 2019

We should list out the obvious concerns:

  1. Proofs-of-possession: We need proofs-of-possession for GRANDPA's BLS keys, but they come up elsewhere, so maybe do them for all key types because doing so looks almost free.
  2. Replay attacks: If neither BABE nor GRANDPA sign the controller account's key, then a controller account can name another controller's session key. We can prevent this with careful importing, but maybe just enforcing it here makes more sense.
  3. Forward security: We could rotate GRANDPA keys more frequently than BABE keys. If only the BABE key should be certified by the controller, and the BABE key should certify the GRANDPA key, then validators could rotate GRANDPA keys without touching their controller key.
  4. Epochs: We might complicate epoch structure if GRANDPA keys are only certified indirectly like in the forward secure variant (3).
  5. Slashing: BABE keys cannot incur too much slashing, but GRANDPA keys can incur lots, but the forward secure variant (3) makes BABE keys more risky.
  6. Avoid innocent equivocations: We ideally want some key to exists only in memory but be registered on-chain so that validators always register it when starting up but can shut down if the current key changes from them.

I think 4 and 5 overrule 3 right now myself since controller account private keys must be easily accessible, even if technically air gaped. If we want really strong forward security then pixel sounds cool, but increases GRANDPA verification costs by 50% and increases GRANDPA signing costs dramatically. If we want weaker forward security then we could later alter the certificate format, but doing so now complicates implementation slightly.

We should not use BABE keys for 6 because they need to be registered well in advance. We could use GRANDPA keys but GRANDPA keys still incur some change over costs. We suggest using transport keys for this. I therefore moved the transport_public from SessionKeys to SessionCertificate.

We'd thus end up with something like:

pub struct SessionKeys {
    /// Controller account public key responsible for this validator, either Ed25519 or Sr25519.
    controller_public: [u8; 32],
    /// Sr25519 public key for BABE
    babe_public: [u8; 32],
    /// BLS public key for GRANDPA
    grandpa_public: [u8; 48],
}
pub struct SessionCertificate {
    keys: SessionKeys,

    /// BLS signature over self.keys
    grandpa_cert: [u8; 96],
    /// Sr25519 or Ed25519 signature over self.{keys, grandpa_cert}
    controller_cert: [u8; 64],

    /// Ed25519 key for TLS 1.3.  Noise converts to X25519.
    transport_public: [u8; 32],   

    /// Sr25519 signature over self.{keys, grandpa_cert, controller_cert, transport_public}
    babe_cert: [u8; 64],
}
pub struct SessionPrivate {
    public_priv: SessionCertificate,

    /// Sr25519 private key
    babe_priv: [u8; 32],
    /// BLS private key
    grandpa_priv: [u8; 32],

    // Ed25519/X25519 private key seed for TLS 1.3 or Noise.
    transport_priv: [u8; 32],
}

You're free to split the private keys in SessionPrivate up separately.

@burdges
Copy link
Contributor

burdges commented Jun 22, 2019

In polkadot, the important constraints boil down to:

BABE Sr25519 keys should not be used in BABE's VRF, i.e. for block production, until a full BABE epoch has elapsed, meaning if a key is registered before epoch j started then the key becomes usable in epoch i+1. We've discussed a notion of full and mini BABE epoch in which the randomness cycle operates in mini epoch but the security analysis happens in full epochs btw.

It's fine if BABE Sr25519 keys and GRANDPA BLS keys are use for signing messages before a full BABE epoch has elapsed however.

As a result, we could permit validator operators to "immediately" deploy a new session key, which halts their block production until a new BABE epoch elapses, but permits them to continue with GRANDPA. This avoids scenarios in which validator operators must choose between (a) some risk that their key was compromised, and (b) the risk of being slashed for being offline. This permits validator nodes to migrates hardware, data center, etc. without transferring session keys, which reduces validator operator error, but at a small financial loss from block production.

We should still support session keys being placed into a "revoked" state, in case the validator operator cannot spin up a new validator node quickly enough.

GRANDPA BLS keys should not be used by any node unless that node has first checked the proof-of-possession. Nodes should probably only check the proof-of-possession for each GRANDPA BLS eky once, which means some local runtime state.

It'll might simplify the code considerably if session keys exist prior to validator election. We caution however that election must not interact any revocation and immediate key update transactions to permit GRANDPA keys to bypass proof-of-possession check, or to permit BABE keys to be used too early for block production.

As above, we still favor the GRANDPA BLS and BABE Sr25519 secret key being written to disk so that validator nodes can go down and come back up without involving the controller key. We recommend the transport Ed25519 secret key never being saved to disk and instead being created and certified by the BABE key on node startup. If this transport Ed25519 key gets pushed to the chain on startup, then we substantially reduce the scenarios in which validator operator error results in slashable equivocation.

Any key changes should receive finality in GRANDPA before taking effect.

@rphmeier
Copy link
Contributor Author

rphmeier commented Feb 10, 2020

Now that #788 is in, we have a bit more of a launching pad to do some kind of AEAD channel with peers who claim ownership of keys.

@mxinden told me that the authority-discovery module is not fast enough for quick authentication of which peers we are connected to that are currently validators.

There are certain messages that we only want to process if we're sure that they came from a particular validator/collator. On the flip side, there are also sometimes requests that we want to make only to certain validators, and the only way to know for sure is to do authentication.

Schnorrkel added some AEAD functions we can use, probably in conjunction with AES256-GCM, which Ring supports.

I'm not sure how the design here should look. I'd like to avoid emulating Transport or protocols as much as possible, but opening these as a separate substream somehow seems like it may be overkill, and also difficult to do in Substrate's network model.

However, given that this involves doing a key agreement with a schnorrkel session key (or collator key), it seems difficult to bring Sentry nodes into the middle. We'd want to do this in a way where sentries can act as simple middlemen.

@burdges
Copy link
Contributor

burdges commented Feb 10, 2020

As I understand it, this issue is about certificates, not encryption. Authenticated encryption (AEAD) does not provide the same sort of authentication as certificates, aka signatures by one key on another key. We already have certificates on session keys by controller keys, right?

AEADs take a symmetric key and show the data was encrypted with that session key, which does many useful things, but AEADs only show that symmetric key encrypted the message. Anyone who can read the message can also forge a different message.

We could use AEADs to make gossip work kinda-like direct connections for Sassafras' pre-announce phase.

@burdges
Copy link
Contributor

burdges commented Feb 10, 2020

@mxinden told me that the authority-discovery module is not fast enough for quick authentication of which peers we are connected to that are currently validators.

Any idea why?

@rphmeier
Copy link
Contributor Author

rphmeier commented Feb 11, 2020

DHT queries are too expensive to write code like is_validator_with_id_in_x(peer_id, x). Collators, for example, want to find all the peers they're connected to who are validators assigned to particular parachains.

Maybe authority-discovery could cache this, but it gets a little funky around session-key changes. I'm open to any solutions that don't require AEAD.

As I understand it, this issue is about certificates, not encryption

That seems to be what you've been talking about, but it's not what I opened the issue about :) - the purpose of this issue was always to find a way to ensure that nodes have a way to know for sure if they are talking to a validator or collator with a particular key and that they are not being MITMed by another node who is talking to that actor. AEAD based on a key exchange is one way of accomplishing that, which is encryption. The other way is certificates for PeerIDs stored on the DHT.

@burdges
Copy link
Contributor

burdges commented Feb 11, 2020

I see! :) I'd prefer if encryption primarily happened at the transport level by correctly using Noise or TLS 1.3 with QUIC, mostly just because these transport level libraries already works hard making this stuff relatively easy to use correctly.

At the transport level, I'd expect the core issue to be that validators have sentry nodes, so the validator should certify all sentry nodes keys somehow. I'd previously suggested that transport level keys be certified by some consensus key, not the controller key, but actually we never thought through exactly what infrastructure people may want for sentry nodes. Is there a need to add or remove sentry nodes automatically? What are the risks? How does revocation of sentry nodes work?

We should've some conversation about when validators need encryption above the transport level too. If direct connections exist then sassafras does not necessarily require encryption, thanks to being only one hop, but maybe easier with encryption though. I'm not averse to sassafras using second layer encryption. We might also want this simply for when messages should not be seen by sentry nodes, but not sure what fits that. What else needs encryption?

cc @infinity0

@rphmeier
Copy link
Contributor Author

Sentry nodes definitely make this harder. That's what I wanted to loop @mxinden in for, and I also spoke with @tomaka a bit about libp2p APIs yesterday.

I guess what would happen is that we'd have communication happening over Noise/Quic, but then we'd layer another Substream on top of that which uses AEAD, which might also be a relay if we're passing through a sentry node.

@rphmeier
Copy link
Contributor Author

Talked with @mxinden and @tomaka today.

There are a couple different techniques that are useful in different situations.

  1. Fetching and caching DHT records from the most recent set of SessionKeys. This uses the AuthorityDiscovery module. These DHT requests are slow, but suitable for discovery. An automatic fetch would be done every ~10 minutes as I understand it.
  2. Targeted discovery requests via AuthorityDiscovery, for when we want to connect to a set of nodes ahead-of-time. This would be useful as a collator or validator to attempt to connect to nodes which are going to be assigned to the same parachain in the near future. Also useful as a fisherman, when requesting pieces of historical data that is meant to be available (PoVBlock and incoming messages).
  3. Peers send each other signed messages of the form (MyPeerId, MySessionKey) which is signed or certificated with MySessionKey. When communicating via a sentry node MyPeerId should be the sentry node's PeerId; otherwise, it should be the validator node's. Nodes would send this whenever they activate a new session key, and would keep track of the last 3 (some small constant) authenticated keys from other peers. This enables the protocol handler to maintain a bounded reverse mapping from ValidatorId -> PeerId for connected nodes, which is fast to look up in. Maybe a ValidatorId -> Set<PeerId> for cases where we are connected to multiple sentries of the same validator.
  4. Put PeerIds on-chain. We'd put this in with the SessionKeys, with a limit of 50-100 or so. It is ambiguous as to which relay-chain state to reference when querying this, as it could potentially differ across forks. Although that is a somewhat degenerate situation that is most likely to occur at session boundaries, I am uneasy writing code that just shrugs and says "probably fine!".

1 and 2 have a couple caveats - there aren't good ways to prove that the owner of a specific session key is the same as the owner of a historical session key. This makes forwarding trust from a previously authenticated connection to their new ID more difficult. We also can't detect if an incoming peer is a validator without making a (potentially slow) query to AuthorityDiscovery. We would have to do this for every peer. We probably want a way for incoming connections to tell us if they are a validator and prove it. Which brings us to options 3 and 4: incoming connections can easily tell us if they are a validator by signing off on their peer ID, or we can look up their PeerId in the chain state.


I'm most in favor of using options 1 & 2 in conjunction to build good discovery APIs, along with option 3 to keep live connections updated. Option 4 seems difficult to implement and also requires us to know which session a peer is on, so I would prefer to avoid it.

@burdges
Copy link
Contributor

burdges commented Feb 12, 2020

We can layer a whole second Noise session over if these are not one off messages. My little AEAD model is more targeted at one off messages.

  1. Fetching and caching DHT records from the most recent set of SessionKeys. This uses the AuthorityDiscovery module. These DHT requests are slow, but suitable for discovery. An automatic fetch would be done every ~10 minutes as I understand it.

I doubt encrypting DHT records helps anything. You just want the DHT connection encrypted for some reason?

Or you're worried about authenticating DHT data? Assuming yes and you want a DHT solution..

If I understand, we're pulling on-chain information from the DHT so that either we can figure out who to talk to to sync the chain, or else we continue as a light client. In the second case, we need to trust the public claims in DHT somehow. Ideally, we'd want each validator set to sign off on the next validator set, which they do in granpa, but that requires reading a whole block. We could maybe put some validator set change information in the block header, so that a light client could pull only specific between -epoch relay chain block headers track the validator set changes and have grandpa signatures from each on the next.

Sentry nodes can only really use a certificate issues by the validator or controller key. (see my previous comment)

  1. Targeted discovery requests via AuthorityDiscovery, for when we want to connect to a set of nodes ahead-of-time. This would be useful as a collator or validator to attempt to connect to nodes which are going to be assigned to the same parachain in the near future. Also useful as a fisherman, when requesting pieces of historical data that is meant to be available (PoVBlock and incoming messages).

I'd expect these use the transport layer encryption and authentication, no?

Is the issue that you want to open connections before fully authenticating nodes? And lazily obtain certificates the long-term keys used to authenticate those connections? That's doable but maybe hard to know you did right.

  1. Peers send each other signed messages of the form (MyPeerId, MySessionKey) which is signed or certificated with MySessionKey. When communicating via a sentry node MyPeerId should be the sentry node's PeerId; otherwise, it should be the validator node's. Nodes would send this whenever they activate a new session key, and would keep track of the last 3 (some small constant) authenticated keys from other peers. This enables the protocol handler to maintain a bounded reverse mapping from ValidatorId -> PeerId for connected nodes, which is fast to look up in. Maybe a ValidatorId -> Set for cases where we are connected to multiple sentries of the same validator.

I've mostly been saying this yes. Yes, you can cache certificate chain checking.

  1. Put PeerIds on-chain. We'd put this in with the SessionKeys, with a limit of 50-100 or so. It is ambiguous as to which relay-chain state to reference when querying this, as it could potentially differ across forks. Although that is a somewhat degenerate situation that is most likely to occur at session boundaries, I am uneasy writing code that just shrugs and says "probably fine!".

I suppose PeerIds means transport keys? It's not essential to put them on chain, but the certificate chain should be checked when they get used, and that ultimately points all the way back to the chain.

1 and 2 have a couple caveats - there aren't good ways to prove that the owner of a specific session key is the same as the owner of a historical session key. This makes forwarding trust from a previously authenticated connection to their new ID more difficult.

Ahh! I see!

We naively assumed one should never forward trust because the chain is our root of trust. We change epochs and eras all the time however, which creates this problem:

In era e, we had some validator with session key X certified by controller key Z. Yet X's obligations go beyond era e. In era e+1 we have some validator with session key Y distinct form X but also certified by controller key Z. Does Y still have the same responsibilities as X?

We might let validators change some session keys faster than era, at which point you replace eras in this with epochs or even shorter.

At first blush, yes we should mostly give Y the same responsibilities as X, and simply track by the controller key that certifies the validator key, but some responsibilities like erasure coded pieces make this hard perhaps. We do not slash for lacking erasure coded pieces though, so maybe new nodes could just make some effort to obtain the most recent ones before their session key became live.

I know grandpa caused some headaches, but I'd expect grandpa rounds should run entirely within a specific session key lifetime. If Z swaps their session key from X to Y then they should maybe keep the node with X up long enough to answer any old grandpa challenges, or else risk being slashed. There are two forms of this:

  1. If they replace the whole node then maybe they take some risks but ideally the X node should attempt to run until it can resolve all any issues. I suppose X staying up complicates our situation since the certificate for Y expires the certificate for X, so this requires some nuance.
  2. If they give the same node a new key for forward security then maybe the node should continue operating for both until the obligations for X are all satisfied. It's messy though because forward security means erasing the key X, so maybe X should just issue (sign) a co-certificate extending the new certificate on Y form Z that say "Y should really have the same data as X" or something, not sure.

We also can't detect if an incoming peer is a validator without making a (potentially slow) query to AuthorityDiscovery.

We care which validator even in both AnV and sassafras, not just that they're some validator.

We would have to do this for every peer. We probably want a way for incoming connections to tell us if they are a validator and prove it. Which brings us to options 3 and 4: incoming connections can easily tell us if they are a validator by signing off on their peer ID, or we can look up their PeerId in the chain state.

These proof can only really be a certificate chain:

  1. Your connection (TLS 1.3 or Noise) first authenticates itself with the supplied long-term key V[0] (PeerId maybe?). It sets i=0 and P[0] = [TRANSPORT].
  2. Next it checks the certificate on V[i] by some supplied long-term key V[i+1]. Aside form merely signing V[i], the certificate message signed by V[i+1] includes the properties P[i] used by V[i] and claims that V[i+1] has properties P[i+1] allowed to delegate P[i].
  3. At some point P[i] includes [CONTROLLER] so we've established the node's "identity" for slashing purposes, and we're done if this is the validator we wanted to talk to.

It's much like the X.509 standard used by HTTP, but terminating in the chain. We rolled our own without the loop because it's normally quite short in our case, and we do not care about the same property sets a X.509, but maybe you need an extra layer to deal with transport keys, especially of sentry nodes, not sure (see my previous comment). Would it help if we abstracted certificate handling in some loop like this?

@gavofyork gavofyork removed this from the PoC-4 (Interchain Communication) milestone May 21, 2020
tomusdrw pushed a commit that referenced this issue Mar 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I2-security The node fails to follow expected, security-sensitive, behaviour.
Projects
None yet
Development

No branches or pull requests

3 participants