Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Extracting the syncing protocol from sc-network #8686

Closed
tomaka opened this issue Apr 28, 2021 · 36 comments · Fixed by #12828
Closed

Extracting the syncing protocol from sc-network #8686

tomaka opened this issue Apr 28, 2021 · 36 comments · Fixed by #12828
Assignees
Labels
I7-refactor Code needs refactoring. Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase.

Comments

@tomaka
Copy link
Contributor

tomaka commented Apr 28, 2021

At the moment, sc-network contains a pretty generic low-level networking stack, on top of which the syncing and transaction protocols are implemented.
sc-network exposes both the API of the low-level networking stack, but also an API for syncing and transaction-related actions.

I think it would make sense to move the syncing and transaction out of sc-network.

I focused on the syncing in the issue title, because transactions are already more or less segregated from the rest.

@tomaka tomaka added I7-refactor Code needs refactoring. Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. labels Apr 28, 2021
@stale
Copy link

stale bot commented Jul 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 7, 2021
@tomaka
Copy link
Contributor Author

tomaka commented Jul 8, 2021

Issue still relevant and important.

@stale stale bot removed the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Jul 8, 2021
@bkchr
Copy link
Member

bkchr commented Nov 15, 2021

I just briefly discussed this with @tomaka. Sync is rather tightly integrated with the networking currently. This manifests mainly in the sync peer-set being explicitly used by grandpa to keep both in sync. For now this should stay this way, this means we keep a hard coded sync peer-set inside the networking crate. For now, we would only move out the sync related code into its own crate.

@bkchr
Copy link
Member

bkchr commented Nov 15, 2021

@wigy-opensource-developer
Copy link
Contributor

Okay, I will try to extract anything related to client/network/src/schema/api.v1.proto into a new crate sc-network-sync under client/network-sync

@bkchr
Copy link
Member

bkchr commented Dec 6, 2021

FYI, it is mainly this https://github.com/paritytech/substrate/blob/master/client/network/src/protocol/sync.rs and this https://github.com/paritytech/substrate/tree/master/client/network/src/protocol/sync.

And while I see that network-gossip is on the client folder level, sync should go to client/network/sync.

@tomaka
Copy link
Contributor Author

tomaka commented Dec 6, 2021

What is complicated is that the code in protocol.rs works as follow:

  • The low-level code reports that a substream has been opened.
  • We synchronously check whether this peer's handshake is ok.
  • If no, we ask the low-level code to close the substream and pretend that this never happened.
  • If yes, we generate a SyncConnected event in the public high-level API.

If you want to plug the sync code through the public high-level API, you need to refactor this, but I don't really know how.

@nazar-pc
Copy link
Contributor

Turns out I need this for changing sync mechanics.
Can anyone share priority/ETA on this so I can manage expectations better?

@wigy-opensource-developer
Copy link
Contributor

wigy-opensource-developer commented Feb 19, 2022

I am learning a lot about the Substrate libp2p integration as I slowly crawl with this PR. I wanted to do too much refactoring at first, but got some intermediate milestone I can reach in a week next to my other ongoing tasks, I guess.

@bkchr
Copy link
Member

bkchr commented Feb 19, 2022

@wigy-opensource-developer can you please give details on what is planned.

@wigy-opensource-developer
Copy link
Contributor

Of course.

As a first step sc-network will depend on an extracted crate sc-network-sync that will contain all handling of the block and state synchronization messages defined in the api.v1 Protobuf protocol.

As a second step message handling for light.v1 will go into an sc-network-light crate, but it will be still sc-network that depends on both.

As a third step, abstractions common to sc-network-sync and sc-network-light will be introduced into sc-network and the direction of dependency will be reversed.

My estimation is that the first step will get to master next week.

@nazar-pc
Copy link
Contributor

Hey, any progress here?

@bkchr
Copy link
Member

bkchr commented Mar 31, 2022

Hey, no there is not yet any real progress. This will also still take quite some time to have something proper.

@nazar-pc
Copy link
Contributor

Is there something I can do to speed up the process here? I can do some refactoring if I understand what it should look like in the end.

@bkchr
Copy link
Member

bkchr commented Mar 31, 2022

Can you again say what you need @nazar-pc?

@nazar-pc
Copy link
Contributor

To replace sync mechanism with a custom one. Let's say node discovers it is significantly behind the tip of the chain and needs to reach out to archival node to get last 1000 blocks. I need that to happen over our distributed storage network instead, archival nodes in traditional sense will not be feasible on our network due to scale.

@bkchr
Copy link
Member

bkchr commented Mar 31, 2022

Do you have any write up of your syncing algorithm? I mean it is enough to have some rough write up on how it works.

@nazar-pc
Copy link
Contributor

There is a lot to it, but somewhat abstractly the algorithm would be like this:

  • for older archival history:
    • identify the tip of the chain and root of the archived history (used to verify individual pieces)
    • start pulling segments of archived history corresponding to desired blocks in parallel from different peers (each segment is 128 pieces erasure coded into 256)
    • verify witness embedded in each of those pieces against root of the history
    • repair source data in case parity pieces were used
    • feed source pieces into reconstructor that will spit out encoded blocks as vectors of bytes alongside with corresponding block numbers (reverse of archiving process encoded blocks are fed into archiver and pieces are returned back)
    • at this point we have blocks we needed and can verify and import them
  • most recent blocks that were not archived yet (archiving is done over blocks at certain confirmation depth) should be synced in normal Substrate way from other full nodes

This replaces how node requests ranges of blocks of archival history from other nodes, I expect that majority of the remaining logic will remain the same as it is in Substrate though.

@bkchr
Copy link
Member

bkchr commented Apr 3, 2022

Okay ty.

@nazar-pc
Copy link
Contributor

I need to expand that syncing Subspace from Subspace DSN using described approach is the first step. We'd like to enable other chains to be able to sync from Subspace DSN in a similar way, so ideally we'd write some kind module that others can import into their project if they so choose. We were discussing about enabling parachains on Rococo to sync this way for instance.

@bkchr
Copy link
Member

bkchr commented Apr 20, 2022

As a first step sc-network will depend on an extracted crate sc-network-sync that will contain all handling of the block and state synchronization messages defined in the api.v1 Protobuf protocol.

@nazar-pc if you like you could start working on this. This is just the first iteration and is rather simple. It is really just about moving the code related to syncing to a new crate. sc-network for now will depend on this crate.

@nazar-pc
Copy link
Contributor

Cool, I'll notify once I start actively working on it

@nazar-pc
Copy link
Contributor

nazar-pc commented May 12, 2022

Part 4 is ready on my side, it makes sc-network, sc-network-light and sc-network-sync depend on sc-network-common, but not on each other. Everything is wired together at sc-service level and sc-network-sync/sc-network-light theoretically can be swapped/customized: #11412
Not 100% sure yet, but it might be just enough for me to do some of the tweaks to the sync process I wanted to do for a while.

Will submit as a PR once third PR is merged.

@bkchr
Copy link
Member

bkchr commented May 14, 2022

Not 100% sure yet, but it might be just enough for me to do some of the tweaks to the sync process I wanted to do for a while.

Sounds good :)

@Yithis
Copy link
Contributor

Yithis commented May 25, 2022

Hi! @nazar-pc do you plan more PR after the Network sync refactoring (part 4) #11412 is merged? All four PRs are a great work and are a major step towards untangling sync from the network. During our (@timorl and myself) investigation of the network, however, we spotted that there are still some parts of sync in the sc-network. Some examples from client/network/src/protocol.rs, from where many methods are forwarded to either ChainSync or behaviour/Notifications:
new_best_block_imported,
on_sync_peer_disconnected,
on_block_response,
on_state_response,
on_warp_sync_response,
on_sync_peer_connected,
announce_block,
push_block_announce_validation.
The list isn't complete, but collects the great part of functions that can't be untangled with sync /network only by basic level refactor. The question is if those are elements you are planning to change in the nearest future? You’ve stated that part 4 probably suffices for your tweaks so the question is for @bkchr and @tomaka if you are open for more PRs that move those remaining parts out and make the network more generic in terms of compatibility with exchangeable synchronization mechanism?

@nazar-pc
Copy link
Contributor

nazar-pc commented May 25, 2022

Yes, I do expect more PRs after that, just need feedback and to sync with maintainers before moving forward.

I'd like to not only sync to be decoupled, but eventually to have networking interface where sc-network would be just one possible implementation with ability to wrap/swap it with a custom one. So there would be a set of traits where some of the methods you mentioned would go for instance.

Just trying to make small steps without breaking things in the process.

UPD: Part 4 covers some of the needs I had, but likely not all of them, so expect more PRs for sure.

@nazar-pc
Copy link
Contributor

With latest PR still not merged I already started working on more changes. First of all addressing some of review comments from that PR that'll land separately, but also on refactorings around NetworkService.

On NetworkService specifically I noticed that there were traits here and there that were proxying calls into NetworkService, so I decided to move them into sc-network-common as well as extracting more traits for different use cases so that most of Substrate would depend on generic traits from sc-network-common instead of NetworkService explicitly.

It'll allow wrapping NetworkService or even wrapping default Substrate networking layer with an alternative just like you can swap consesnsus and other parts. It'll take time to make those interfaces truly generic (making it independent from libp2p is out of scope though IMO, so things like PeerId will remain in many places).

@bkchr
Copy link
Member

bkchr commented Aug 22, 2022

@nazar-pc what is your future plans here? Do you still want to continue pursuing this issue or can I assign it to someone else?

@nazar-pc
Copy link
Contributor

I was looking into sync algorithm specifically a few times over last few weeks and before that. There are still many bits of sync left in sc-network. I was thinking about moving those into some kind of standalone worker that will be driven independently from the network as such, but interact with it when necessary, but I don't have a specific plan there yet, just a rough idea.

I'm still very much interested in pursuing this, just depends on how much time I'll have.

@bkchr
Copy link
Member

bkchr commented Aug 22, 2022

Okay :) Yeah, sync is still really deeply backed into sc-network and this assumptions needs to be removed. In the end sync should be its own protocol that just sits on top of sc-network like Polkadot networking for example.

@the-right-joyce the-right-joyce moved this to Backlog 🗒 in Networking Aug 25, 2022
@the-right-joyce the-right-joyce removed this from the Network protocols moved out milestone Sep 1, 2022
@altonen
Copy link
Contributor

altonen commented Sep 15, 2022

@nazar-pc I've also started to slowly work on this. So far I'm just getting familiar with the code and seeing how deep it goes but if you have any pending changes you've made since #11940 that you feel like sharing, that would be much appreciated so we don't do the same work twice.

@nazar-pc
Copy link
Contributor

I don't have anything refactoring-wise pending.

@altonen altonen self-assigned this Sep 26, 2022
@altonen
Copy link
Contributor

altonen commented Oct 4, 2022

I have removed rest of the syncing code from sc-network, demo/PoC is here. It's not meant to be reviewed as-is and acts more as a reference because it's a) too big for a single PR b) has some solutions that are still a little rough around the edges and c) and is lacking tests. I'll be working next towards splitting it into smaller and more reviewable PRs.

Unfortunately, as we want to keep the hardcoded entry in the peerset, I've had to add some workarounds around the sc-network code to make it work with the now "transparent" syncing protocol and the calls it makes. I wish to remove these in the future if we can remove the hardcoded entry from the peerset. The testing framework is also going through some async growing pains which means that some tests in finality-grandpa must be rewritten to be fully async or the testing framework (e.g., for adding blocks for peers) must be internally async and then just expose sync interface but that may have unintended consequences. I'll get back to dealing with that issue when the time comes.

Currently the syncing code is installed as a bunch of generic request-response protocols (+ block announcement notification protocol) and the NetworkPeers is amended to contain report_peer_validation_result(PeerId, Enum::{Accept, Reject}) which the syncing code calls when it has determined whether Substrate should keep the connection open to this peer. If the peer is accepted by the syncing code, only then does sc-network emit SyncConnected event. This can be made more generic in the future by providing some mechanism for protocols to inform sc-network that they would want to validate incoming peers and sc-network would then wait for validation results from all of these protocols before emitting SyncConnected event. I would prefer something like this to what is currently implemented but I'm not sure if there is a need for such a mechanism, maybe in the future. I think it may be worth trying to fix paritytech/polkadot-sdk#556 while working with the handshaking/peer connection code.

NetworkNotifications is also modified (temporarily) to contain write_sync_notification() which uses the hardcoded peerset ID to send syncing-related notifications to peers. I don't like it at all but as long as we have hardcoded entries inside sc-network, we need to have some mechanism to use that entry and this is probably the path with least friction. I hope to remove it in the near future.

This was more of a status update. I've started to split the work and I hope to have something reviewable soonish. Let me know what you think about all of the above. I would especially appreciate any feedback on what needs to be done so that the hardcoded entry can be removed from the peerset.

@altonen altonen moved this from Backlog 🗒 to In Progress 🛠 in SDK Node Oct 5, 2022
@bkchr
Copy link
Member

bkchr commented Oct 11, 2022

only then does sc-network emit SyncConnected event.

This should entirely go out of sc-network. I think what we should do is the possibility to register in the sync protocol callbacks aka channels that get informed when sync adds or removes a peer. Then we could use this to make any protocol that follows the same peers as sync to use these callbacks. sc-network should really not assume anything about the protocols running on top of it.

@altonen altonen linked a pull request Dec 2, 2022 that will close this issue
@altonen altonen moved this from In Progress 🛠 to Code in review 🧐 in SDK Node Dec 2, 2022
@the-right-joyce the-right-joyce moved this from Backlog 🗒 to Code in review 🧐 in Networking Dec 7, 2022
@github-project-automation github-project-automation bot moved this from Code in review 🧐 to Blocked ⛔️ in Networking Mar 6, 2023
@github-project-automation github-project-automation bot moved this from Code in review 🧐 to Done ✅ in SDK Node Mar 6, 2023
@altonen altonen moved this from Blocked ⛔️ to Done ✅ in Networking Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I7-refactor Code needs refactoring. Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase.
Projects
Status: Done
Status: done
8 participants