Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add creation time for DHT authority discovery records #91

Merged
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions text/0091-dht-record-creation-time.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# RFC-0091: DHT Authority discovery record creation time

| | |
| --------------- | ------------------------------------------------------------------------------------------- |
| **Start Date** | 2024-05-20 |
| **Description** | Add creation time for DHT authority discovery records |
| **Authors** | Alex Gheorghe (alexggh) |

## Summary

Extend the DHT authority discovery records with a signed creation time, so that nodes can determine which record is newer and always decide to prefer the newer records to the old ones.

## Motivation

Currently, we use the Kademlia DHT for storing records regarding the p2p address of an authority discovery key, the problem is that if the nodes decide to change its PeerId/Network key it will publish a new record, however because of the distributed and replicated nature of the DHT there is no way to tell which record is newer so both old PeerId and the new PeerId will live in the network until the old one expires(36h), that creates all sort of problem and leads to the node changing its address not being properly connected for up to 36h.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that if the nodes decide to change its PeerId/Network key it will publish a new record, however because of the distributed and replicated nature of the DHT there is no way to tell which record is newer so both old PeerId and the new PeerId will live in the network until the old one expires(36h)

I don't think that's true? The new record is very much supposed to overwrite the old one.

Copy link
Contributor

@tomaka tomaka May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Putting a record" consists in finding the 20 peers whose PeerId is the closest to the relevant key (here the relevant key is the authority discovery key) and sending them the record. If later you want to modify the record, you find these 20 peers again (they are normally still the same) and send them a new record, which overwrites the old one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true? The new record is very much supposed to overwrite the old one.

I've tested this several times on different network configurations and the old record still survives in the network until it expires, as far as I can understand this happens because you do publish the record initially to the closest 20 nodes, but then when the record reaches for different reasons other nodes will still store it locally with this formula: https://github.com/libp2p/rust-libp2p/blob/ad7ad5b3fc5b4bc9a431ece90e9a5ce8c33ca0e2/protocols/kad/src/behaviour.rs#L1796, so not just the initial 20 nodes would store a record.

Copy link
Contributor

@tomaka tomaka May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the record reaches for different reasons

Well, this isn't really supposed to happen. It can happen in some rare cases where the publisher has been lied to during its iterative query, for example.

Even if the record reaches nodes other than the initial 20 closest nodes, reading the record (i.e. querying it) is always done against the 20 closest nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, double checked again, nodes don't actually have perfect connectivity, so their view of what are peer ids are not always the same, especially after your restart, so we end up in a situation where we don't always publish to the same 20 closest nodes, so you need just one node to still have the old record and then because that will replicate the record to its view of 20 closest nodes, we end up in a situation where the old record will override the new record.

The issue can be replicated systematically, by running a network of around 30-40 nodes, you let it run for sometime then you restart one of the node with a different PeerID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the exact scenario I'm testing is like this:

  1. Run a stable polkadot network.
  2. Just restart a single node with a new network key(PeerId) which makes the node to publish a new value on the dht for the key it is responsible.

What I'm observing is that the old record of the single node is actually on more than the 20 closest nodes that get updated at step 2) and because those nodes do replication https://github.com/libp2p/rust-libp2p/blob/ad7ad5b3fc5b4bc9a431ece90e9a5ce8c33ca0e2/protocols/kad/src/behaviour.rs#L2520 regularly, it is overwriting the newer record.

I consider this to be a valid use case since it is what happens when nodes upgrade and forget to persist their network key, it happened a few times in the past on kusama.

For the tests I ran the network does not seem to be split since I see everyone communicating with everyone on the other protocols, so what do you think am I missing here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole discussion is off-topic for this RFC, but the record could be on more than 20 nodes simply because during step 1 you have temporary net splits. What's important is that the 20 nodes where the record is updated at step 2 are the same as the ones queried by the rest of the network when they want to read the record. The logs would tell you that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's important is that the 20 nodes where the record is updated at step 2 are the same as the ones queried by the rest of the network when they want to read the record. The logs would tell you that

Right, but because the one(outside of the closest 20) still has the old record and does replication on the closest 20, all will have the old record.

This whole discussion is off-topic for this RFC

It is the reason I started the RFC in the first place, but I agree it shouldn't affect the RFC, I'll go ahead and apply your suggestion about using a single signature.

Copy link
Contributor

@tomaka tomaka May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but because the one(outside of the closest 20) still has the old record and does replication on the closest 20, all will have the old record.

As far as I can remember only the original publisher periodically re-publishes and can periodically re-publish.
When receiving a record, a node ensures that the PeerId of the sender of the record (i.e. the "live" PeerId of the connection where the record is being received) can be found in the list of PeerIds that is found in the record and that is covered by the signature.
Otherwise you would have a pretty obvious vulnerability, and we thought of these kind of things when designing the system.

It is the reason I started the RFC in the first place, but I agree it shouldn't affect the RFC, I'll go ahead and apply your suggestion about using a single signature.

I see this RFC as a hack to work around a bug somewhere, and it would be a better idea to actually fix the bug. However I'm personally not really involved in Polkadot anymore, so I'm not going to vote against.

Copy link
Contributor Author

@alexggh alexggh Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to this.

As far as I can remember only the original publisher periodically re-publishes and can periodically re-publish.
When receiving a record, a node ensures that the PeerId of the sender of the record (i.e. the "live" PeerId of the connection where the record is being received) can be found in the list of PeerIds that is found in the record and that is covered by the signature.
Otherwise you would have a pretty obvious vulnerability, and we thought of these kind of things when designing the system.

Yes, you are correct, re-publishing is done only by the original published, but there is also replication which is done by all peers that store the record, the integrity of the record is guaranteed by the fact that the signature is checked, so you can't replicate anything, you need to replicate the original record.

I see this RFC as a hack to work around a bug somewhere, and it would be a better idea to actually fix the bug. However I'm personally not really involved in Polkadot anymore, so I'm not going to vote against.

I've investigated this and I couldn't find anything wrong with the way it works, the scenario this fixes it is just a consequence of the fact that not all nodes have the same routing table always, so at different point in time the 20 closest 20 nodes could differ slightly.

However I'm personally not really involved in Polkadot anymore, so I'm not going to vote against.

Then if you don't strongly oppose this RFC, I intend to move forward with adding this new field(creation_time) since I see benefits in having it, so we can use it to always pick the newest record.


After this RFC, nodes are extended to decide to keep the new record and propagate the new record to nodes that have the old record stored, so in the end all the nodes will converge faster to the new record(in the order of minutes, not 36h)

Implementation of the rfc: https://github.com/paritytech/polkadot-sdk/pull/3786.

Current issue without this enhacement: https://github.com/paritytech/polkadot-sdk/issues/3673

## Stakeholders

Polkadot node developers.

## Explanation

This RFC heavily relies on the functionalities of the Kademlia DHT already in use by Polkadot.
You can find a link to the specification [here](https://github.com/libp2p/specs/tree/master/kad-dht).

In a nutshell, on a specific node the current authority-discovery protocol publishes Kademila DHT records at startup and periodically. The records contain the full address of the node for each authorithy key it owns. The node tries also to find the full address of all authorities in the network by querying the DHT and picking up the first record it finds for each of the authority id it found on chain.

The authority discovery DHT records use the protobuf protocol and the current format is specified [here](https://github.com/paritytech/polkadot-sdk/blob/313fe0f9a277f27a4228634f0fb15a1c3fa21271/substrate/client/authority-discovery/src/worker/schema/dht-v2.proto#L4). This RFC proposese extending the schema in a backwards compatible manner by adding a new optional `creation_time` field to `AuthorityRecord` and nodes can use this information to determine which of the record is newer.

Diff of `dht-v3.proto` vs `dht-v2.proto`

```
@@ -1,10 +1,10 @@
syntax = "proto3";

-package authority_discovery_v2;
+package authority_discovery_v3;

// First we need to serialize the addresses in order to be able to sign them.
message AuthorityRecord {
repeated bytes addresses = 1;
+ // Time since UNIX_EPOCH in nanoseconds, scale encoded
+ TimestampInfo creation_time = 2;
}

message PeerSignature {
@@ -13,11 +15,17 @@
bytes public_key = 2;
}

+// Information regarding the creation data of the record
+message TimestampInfo {
+ // Time since UNIX_EPOCH in nanoseconds, scale encoded
+ bytes timestamp = 1;
+}
+
```

Each time a node wants to resolve an authorithy ID it will issue a query with a certain redundancy factor, and from all the results it receives it will decide to pick only the newest record. Additionally, the nodes that answer with old records will be updated with the newer record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the nodes that answer with old records will be updated with the newer record.

So, the node that send the DHT request will inform these nodes?

Strictly speaking, I don't think that this needs to be part of the RFC, because even a node not doing this would behave well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct, I was just trying to explain how a well behaving validator should/could use this field, I can explicitly state that this part is optional or remove it.



## Drawbacks

In theory the new protocol creates a bit more traffic on the DHT network, because it waits for DHT records to be received from more than one node, while in the current implementation we just take the first record that we receive and cancel all in-flight requests to other peers. However, because the redundancy factor will be relatively small and this operation happens rarerly, every 10min, this cost is negligible.

## Testing, Security, and Privacy


This RFC's implementation https://github.com/paritytech/polkadot-sdk/pull/3786 had been tested on various local test networks and versi.

With regard to security the creation time is wrapped inside SignedAuthorityRecord wo it will be signed with the authority id key, so there is no way for other malicious nodes to manipulate this field without the received node observing.

## Performance, Ergonomics, and Compatibility

Irrelevant.

### Performance

Irrelevant.

### Ergonomics

Irrelevant.

### Compatibility

The changes are backwards compatible with the existing protocol, so nodes with both the old protocol and newer protocol can exist in the network, this is achieved by the fact that we use protobuf for serializing and deserializing the records, so new fields will be ignore when deserializing with the older protocol and vice-versa when deserializing an old record with the new protocol the new field will be `None` and the new code accepts this record as being valid.
bkchr marked this conversation as resolved.
Show resolved Hide resolved

## Prior Art and References

The enhancements have been inspired by the algorithm specified in [here](https://github.com/libp2p/specs/blob/master/kad-dht/README.md#value-retrieval)

## Unresolved Questions

N/A

## Future Directions and Related Material

N/A