-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactive syncing metrics #5410
Reactive syncing metrics #5410
Conversation
…task of chain sync and make them reactive
e8fde12
to
9135bb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you! With updating metrics on every operation it's more difficult to check every update is handled. Hope you have searched through the code and made sure everything is covered.
if let Some(metrics) = &self.metrics { | ||
metrics.failed.set(self.failed_requests.len().try_into().unwrap_or(u64::MAX)); | ||
metrics.pending.inc(); | ||
} | ||
} else { | ||
trace!(target: LOG_TARGET, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of scope of this PR, but it looks like we can legitimately hit this if the response was delivered after the peer disconnected event arrived. This can happen depending on the order different protocols/streams are polled as these events are emitted by different objects (notifications protocol for sync peers and request-response protocol for responses).
Created an issue for this: #5414.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope you have searched through the code and made sure everything is covered.
I checked all occurrences and invariants and I don't think I missed anything.
if let Some(request) = self.active_requests.remove(&who) { | ||
if let Some(metrics) = &self.metrics { | ||
metrics.active.dec(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would grouping the request hashmaps (or vec) with metrics help here?
Some time in the future we might do a active_requests.remove()
and forget to decrement the appropriate metric.
One suggestion might be to group them into a wrapper:
struct ActiveRequestsMetered {
inner: HashMap<..>, // Current self.active_requests
metrics: Option<Metrics>,
}
Maybe an even simpler approach would be to introduce a fn active_requests_remove() { ...; metrics.active.dec(); }
.
Could also be a followup if would take too long to implement :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several of these in the code and I'm not sure how much it would help with ergonomics to be completely honest. Metrics are one of those things that are just inherently invasive unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for contributing!
…rics # Conflicts: # substrate/client/network/sync/src/strategy.rs
I resolved minor merge conflict, but also had to restore useless tick. Without this tick tests like polkadot-sdk/substrate/client/network/test/src/sync.rs Lines 1020 to 1024 in 9fecd89
If someone with more knowledge in this area can take a look at it that'd be great, but it shouldn't block this PR anymore. |
CI seems to be good now |
4057ccd
* master: (36 commits) Bump the ci_dependencies group across 1 directory with 2 updates (#5401) Remove deprecated calls in cumulus-parachain-system (#5439) Make the PR template a default for new PRs (#5462) Only log the propagating transactions when they are not empty (#5424) [CI] Fix SemVer check base commit (#5361) Sync status refactoring (#5450) Add build options to the srtool build step (#4956) `MaybeConsideration` extension trait for `Consideration` (#5384) Skip slot before creating inherent data providers during major sync (#5344) Add symlinks for code of conduct and contribution guidelines (#5447) pallet-collator-selection: correctly register weight in `new_session` (#5430) Derive `Clone` on `EncodableOpaqueLeaf` (#5442) Moving `Find FAIL-CI` check to GHA (#5377) Remove panic, as proof is invalid. (#5427) Reactive syncing metrics (#5410) [bridges] Prune messages from confirmation tx body, not from the on_idle (#5006) Change the chain to Rococo in the parachain template Zombienet config (#5279) Improve the appearance of crates on `crates.io` (#5243) Add initial version of `pallet_revive` (#5293) Update OpenZeppelin template documentation (#5398) ...
This PR untangles syncing metrics and makes them reactive, the way metrics are supposed to be in general.
Syncing metrics were bundled in a way that caused coupling across multiple layers: justifications metrics were defined and managed by
ChainSync
, but only updated periodically on tick inSyncingEngine
, while actual values were queried fromExtraRequests
. This convoluted architecture was hard to follow when I was looking into #5333.Now metrics that correspond to each component are owned by that component and updated as changes are made instead of on tick every 1100ms.
This does add some annoying boilerplate that is a bit harder to maintain, but it separates metrics more nicely and if someone queries them more frequently will give arbitrary resolution. Since metrics updates are just atomic operations I do not expect any performance impact of these changes.
Will add prdoc if changes look good otherwise.
P.S. I noticed that importing requests (and corresponding metrics) were not cleared ever since corresponding code was introduced in dc41558#r145518721 and I left it as is to not change the behavior, but it might be something worth fixing.
cc @dmitry-markin