Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing strategy refactoring (part 3) #5737

Merged

Conversation

nazar-pc
Copy link
Contributor

@nazar-pc nazar-pc commented Sep 17, 2024

Description

This is a continuation of #5666 that finally fixes #5333.

This should allow developers to create custom syncing strategies or even the whole syncing engine if they so desire. It also moved syncing engine creation and addition of corresponding protocol outside build_network_advanced method, which is something Bastian expressed as desired in #5 (comment)

Here I replaced strategy-specific types and methods in SyncingStrategy trait with generic ones. Specifically SyncingAction is now used by all strategies instead of strategy-specific types with conversions. StrategyKey was an enum with a fixed set of options and now replaced with an opaque type that strategies create privately and send to upper layers as an opaque type. Requests and responses are now handled in a generic way regardless of the strategy, which reduced and simplified strategy API.

PolkadotSyncingStrategy now lives in its dedicated module (had to edit .gitignore for this) like other strategies.

build_network_advanced takes generic SyncingService as an argument alongside with a few other low-level types (that can probably be extracted in the future as well) without any notion of specifics of the way syncing is actually done. All the protocol and tasks are created outside and not a part of the network anymore. It still adds a bunch of protocols like for light client and some others that should eventually be restructured making build_network_advanced just building generic network and not application-specific protocols handling.

Integration

Just like #5666 introduced build_polkadot_syncing_strategy, this PR introduces build_default_block_downloader, but for convenience and to avoid typical boilerplate a simpler high-level function build_default_syncing_engine is added that will take care of creating typical block downloader, syncing strategy and syncing engine, which is what most users will be using going forward. build_network towards the end of the PR was renamed to build_network_advanced and build_network's API was reverted to pre-#5666, so most users will not see much of a difference during upgrade unless they opt-in to use new API.

Review Notes

For StrategyKey I was thinking about using something like private type and then storing TypeId inside instead of a static string in it, let me know if that would preferred.

The biggest change happened to requests that different strategies make and how their responses are handled. The most annoying thing here is that block response decoding, in contrast to all other responses, is dependent on request. This meant request had to be sent throughout the system. While originally Response was Vec<u8>, I didn't want to re-encode/decode request and response just to fit into that API, so I ended up with Box<dyn Any + Send>. This allows responses to be truly generic and each strategy will know how to downcast it back to the concrete type when handling the response.

Import queue refactoring was needed to move SyncingEngine construction out of build_network that awkwardly implemented for SyncingService, but due to &mut self wasn't usable on Arc<SyncingService> for no good reason. Arc<SyncingService> itself is of course useless, but refactoring to replace it with just SyncingService was unfortunately rejected in #5454

As usual I recommend to review this PR as a series of commits instead of as the final diff, it'll make more sense that way.

Checklist

  • My PR includes a detailed description as outlined in the "Description" and its two subsections above.
  • My PR follows the labeling requirements of this project (at minimum one label for T required)
    • External contributors: ask maintainers to put the right label on your PR.
  • I have made corresponding changes to the documentation (if applicable)

@nazar-pc
Copy link
Contributor Author

@dmitry-markin @lexnv this is probably the last one in a series, I'm reasonably happy with the achieved result and will try to rebase our downstream changes on top of these new APIs without modifying Substrate itself.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Sep 17, 2024

@ggwpez you commented on CI stuff before, so maybe you can help. Apparently Rust version used for formatting has changed, but since it just says nightly in the workflow and I find exact version used in .github/workflows/reusable-preflight.yml, I no longer know which version am I even supposed to use to do the formatting locally. I used nightly-2024-04-10 before, but apparently it is the wrong version now.

Request: can preflight workflow print the versions maybe for reference? It prints some stuff, but nothing actually useful IMO.

UPD: I was applying formatting in a different subdirectory, but the request is still relevant.

@nazar-pc nazar-pc force-pushed the syncing-strategy-refactoring-part-3 branch 2 times, most recently from 938b988 to 6012b67 Compare September 17, 2024 13:32
@nazar-pc nazar-pc force-pushed the syncing-strategy-refactoring-part-3 branch from 6012b67 to 53b34d0 Compare September 17, 2024 14:45
@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Sep 17, 2024
@nazar-pc
Copy link
Contributor Author

Addressed Bastian's comment from #5666 and resolved conflicts with master.

@ggwpez
Copy link
Member

ggwpez commented Sep 20, 2024

Yea am very much in favour of printing all versions at the beginning of CI jobs, (Rust, rustup, cargo, clippy, fmt, nextest) to easily debug stuff.
I can add some prints myself when i see it the next time, otherwise i think @AndWeHaveAPlan was working on this.

@ggwpez
Copy link
Member

ggwpez commented Sep 20, 2024

bot fmt

@command-bot
Copy link

command-bot bot commented Sep 20, 2024

@ggwpez https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7399653 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 6-6aed9a84-d20a-4977-aa73-1e1cdd7a7e1c to cancel this command or bot cancel to cancel all commands in this pull request.

@nazar-pc
Copy link
Contributor Author

It is already all formatted properly, just took a few attempts 🙂

@command-bot
Copy link

command-bot bot commented Sep 20, 2024

@ggwpez Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7399653 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7399653/artifacts/download.

@nazar-pc
Copy link
Contributor Author

Friendly ping here, would be nice to merge this before merge conflicts happen. It shouldn't be as scary as the diff might suggest.

nazar-pc added a commit to autonomys/polkadot-sdk that referenced this pull request Sep 27, 2024
Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good and indeed finally allow custom syncing strategies use with substrate. This is what was originally planned when separate syncing strategies were introduced. Thank you for implementing this!

templates/solochain/node/Cargo.toml Outdated Show resolved Hide resolved
}
if remove_obsolete {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove pending responses first, then check if our peers contain the peer_id?

I think we were removing the pending_responses before send_block_request checked if this is a known peer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above branch is never ever supposed to happen (see debug_assert!(false)), so I don't think it makes any difference either way whether we remove pending responses before crashing or not.

Also removed redundant comment below in latest push.

self.protocol_name.clone(),
request.encode_to_vec(),
tx,
IfDisconnected::ImmediateError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: Should we try to establish connection instead of immediate errors?

Would you think it is beneficial for strategies to provide this option in their actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as it was before, so I didn't change the behavior here and so far there was no need to do this.

Maybe the assumption is that we only send requests to connected peers, so if someone disconnected we just error instead of attempting to dial them?

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for contributing!

Copy link
Contributor Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also resolved a minor merge conflict in imports after #5686

self.protocol_name.clone(),
request.encode_to_vec(),
tx,
IfDisconnected::ImmediateError,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as it was before, so I didn't change the behavior here and so far there was no need to do this.

Maybe the assumption is that we only send requests to connected peers, so if someone disconnected we just error instead of attempting to dial them?

}
if remove_obsolete {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above branch is never ever supposed to happen (see debug_assert!(false)), so I don't think it makes any difference either way whether we remove pending responses before crashing or not.

Also removed redundant comment below in latest push.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

Merged master and CI should pass now, let me know if there is anything else I can do to push this across the finish line

@dmitry-markin
Copy link
Contributor

I will run some final tests, and, if everything fine, we can merge the PR.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 5, 2024

We're running a fork with these changes for over a month now without issues

@EgorPopelyaev
Copy link
Contributor

@dmitry-markin How does this PR look like, is it fine to be merged?

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Nov 7, 2024

@dmitry-markin How does this PR look like, is it fine to be merged?

I am testing it on Versi to make sure everything is good. Code-wise it is fine. In any case, let's stabilize it in master first skipping stable2412.

@nazar-pc you are using a fork anyway and not need it in the polkadot-sdk release, aren't you?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 7, 2024

I was really-really trying to upstream out patches wherever possible, it is kind of a pain to rebase everything on each release. I was hoping that opening PR in September will make it into December build, I'd be disappointed if it didn't. There isn't really anything crazy in this PR, mostly just moving things around, I'd say the risk of regressions is low and will be evident immediately.

We just launched Mainnet yesterday (already 1239 consensus nodes) with this included after more than a month of testnets, there was zero complaints so far about anything that would be potentially related to this for what it is worth.

@dmitry-markin
Copy link
Contributor

Then I got it wrong, sorry @nazar-pc. Let's see what we can do regrading the December release.

@dmitry-markin dmitry-markin added this pull request to the merge queue Nov 7, 2024
Merged via the queue into paritytech:master with commit 12d9052 Nov 7, 2024
194 of 196 checks passed
@nazar-pc nazar-pc deleted the syncing-strategy-refactoring-part-3 branch November 7, 2024 11:04
@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 7, 2024

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make SyncingStrategy abstract and allow developers to customize it
5 participants