Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize the peerset state on disk and reload it at startup #565

Open
tomaka opened this issue Jan 28, 2020 · 11 comments
Open

Serialize the peerset state on disk and reload it at startup #565

tomaka opened this issue Jan 28, 2020 · 11 comments
Labels
D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I5-enhancement An additional feature request.

Comments

@tomaka
Copy link
Contributor

tomaka commented Jan 28, 2020

We used to store a nodes.json file on disk containing the known state of the network. This feature got removed when we introduced the peerset manager, but it has always been intended that it should be restored.

It is fundamentally better to not be based too much on bootnodes, and instead try to restore connections to previously working nodes.

This issue is about restoring this feature.

@tomaka
Copy link
Contributor Author

tomaka commented Feb 17, 2020

In the past, we used to use a network-specific config_path to store the JSON file.
Maybe we could put that in the database instead, in order to reduce the number of files that Substrate/Polkadot maintains? I'm not sure whether that's a good idea.

@stale
Copy link

stale bot commented Jul 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale label Jul 7, 2021
@tomaka
Copy link
Contributor Author

tomaka commented Jul 8, 2021

Issue still relevant and important.

@RGafiyatullin
Copy link

RGafiyatullin commented Jul 25, 2022

I had a look into the sc-network and sc-peerset cratres.

I would like to proceed with the following solution.

Add another CLI argument --persist-peers.
If set the node shall periodically save the information about the currently connected (and recently connected) peers, and upon startup — will prefer the persisted peers over the provided --bootnodes <NODES> .

In order to implement such behaviour, the modifications of the sc-peerset::Peerset/PeersetHandle and the sc-network::DiscoveryBehaviour would be required. There will be two pieces of persistent data:

  • the PSM data;
  • the Peers' addresses.

Keep track of the PeerIds of the connected nodes.

The Peerset (sc-peerset) is responsible for initiating the connection to another peer: it allocates the available slots among the known peers with the preference to the reserved nodes.

The state of Peerset can be saved periodically, and restored upon node's startup.

The only thing to do in sc-peerset is to add an ability to dump the current PSM's state.

The persistence-routine is to be done within the sc-network (probably sc-network::DiscoveryBehaviour).

Keep a "last-resort" cache of Multiaddrs.

The Peerset though operates in terms of PeerIds, which is not sufficient to rejoin the previosly connected peers, there is also a need to persist the Multiaddrs of the connected peers.
Those can be saved periodically in an LRU fasion by DiscoveryBehaviour.

The "last-resort" thing: normally the peers would be resolved as they are now, but if <DiscoveryBehaviour as NetworkBehaviour>::addresses_of_peer(...) fails to resolve a peer-id into an address — this cache is used, hence the "last-resort".

Something I am not sure how to implement

If the node has only inbound connections — it does not know any addresses worth saving.

Of course if that node restarts, chances are that the previously connected peers will "drag" this node back into the network using the previously known endpoint. But this isn't terribly robust: if the node changes address between the restarts, or goes offline for the time sufficient to be "forgotten" — it won't be able to rejoin the network.

Probably the solution to this problem would be to eagerly look for the connectable-addresses to keep the Multiaddrs cache populated at all times.

@bkchr
Copy link
Member

bkchr commented Jul 25, 2022

Add another CLI argument --persist-peers.
If set the node shall periodically save the information about the currently connected (and recently connected) peers, and upon startup — will prefer the persisted peers over the provided --bootnodes <NODES> .

Why do we need a CLI option for this?

If the node has only inbound connections — it does not know any addresses worth saving.

How should you only have inbound connections? If you only have these, you would have some problems already. You would for example running into a eclipse attack. You can just ignore this setting.

@tomaka
Copy link
Contributor Author

tomaka commented Jul 25, 2022

The debug module contains the identity system, which asks the peers we are connected to what their addresses are.

@koute
Copy link
Contributor

koute commented Jul 26, 2022

I second what @bkchr said, it'd make more sense to me to just have this behavior be the default instead of requiring the user to pass a CLI argument to enable it. (On the other hand adding an option to disable it might make some sense, e.g. for debugging purposes or something.)

@bkchr
Copy link
Member

bkchr commented Aug 25, 2022

To keep track of what @RGafiyatullin and me discussed in dm:

I think that for we should persist all peers. With some kind of Peer { peer_id, addr, last_seen } while last_seen is just the unix timestamp we were connected to this node. We could use this to prune all nodes that we weren't connected for more than one week or something. Then I would just load the list on startup and put like half of the nodes with the most recent last_seen and then some random sampling of the other nodes. These nodes could be passed in as boot nodes to the sync peer set.

What I don't know and what may could also work, can we not just announce all available peers to the peer set and it starts connecting to some of them? I mean when we currently connect to a boot node it will share known nodes with us (I don't know how it works for sure) and then we start connecting to these nodes as well. Could we not hook in there directly? So instead of getting these peers from a different node, we insert the known peers?

One other random question, should we may not persist inbound peers? Otherwise this could may lead to some eclipse attack? As after the next restart these nodes would be outbound peers from our POV?

Would be nice to get some feedback from you @tomaka.

@nazar-pc
Copy link
Contributor

FWIW we have implemented simple networking persistence in Subspace and decided that first failure time is the best thing to store, so we can kick non-responsive peers eventually.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 25, 2022

@bkchr It's actually insanely complicated. The whole story about how we store peers, addresses, how long (if you store them forever, you've got a memory leak), which addresses we try, when do we stop trying addresses, etc. is a big hack, because in 3 years of networking I've never figured out an algorithm to do this properly.

@bkchr
Copy link
Member

bkchr commented Aug 25, 2022

@bkchr It's actually insanely complicated. The whole story about how we store peers, addresses, how long (if you store them forever, you've got a memory leak), which addresses we try, when do we stop trying addresses, etc. is a big hack, because in 3 years of networking I've never figured out an algorithm to do this properly.

Yeah, but for now I'm more interested in have an idea what would be the best place to hook this in. What kind of pruning strategy etc could be figured out over time.

@altonen altonen transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added I5-enhancement An additional feature request. D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. and removed J0-enhancement labels Aug 25, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
* Fresh runtime api instance per call estimation

* rustfmt skip

* fmt
bkchr pushed a commit that referenced this issue Apr 10, 2024
* High level docs - start.

* Clean up README

* Start adding details to high level docs

* More docs on the header sync pallet

* Testing scenarios document.

* Add some scenarios.

* Add multi-sig scenario.

* Start writing about message dispatch pallet

* Move content from old README into PoA specific doc

* Apply suggestions from code review

Co-authored-by: Andreas Doerr <adoerr@users.noreply.github.com>

* GRANDPA for consistency.

* Describe scenario steps.

* WiP

* Add notes about block production and forks

* Update.

* Add sequence diagram for Millau to Rialto transfer

* Clean up header sync pallet overview

* Remove leftover example code

* Clean up testing scenarios and amend sequence diagram.

* Linking docs.

* Add some more docs.

* Do a bit of cleanup on the high-level docs

* Clean up the testing scenario

* Fix typos in flow charts

* Fix small typo

* Fix indentation of Rust block

* Another attempt at rendering block correctly

* TIL about lazy list numbering in Markdown

* Add list numbers across sections

* Start counting from correct number

* Update README to use correct path to local scripts

* Wrap ASCII art in code block

Co-authored-by: Tomasz Drwięga <tomasz@parity.io>
Co-authored-by: Tomasz Drwięga <tomusdrw@users.noreply.github.com>
Co-authored-by: Andreas Doerr <adoerr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I5-enhancement An additional feature request.
Projects
Status: Backlog 🗒
Development

No branches or pull requests

7 participants