Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactive syncing metrics #5410

Merged
merged 9 commits into from
Aug 23, 2024

Conversation

nazar-pc
Copy link
Contributor

This PR untangles syncing metrics and makes them reactive, the way metrics are supposed to be in general.

Syncing metrics were bundled in a way that caused coupling across multiple layers: justifications metrics were defined and managed by ChainSync, but only updated periodically on tick in SyncingEngine, while actual values were queried from ExtraRequests. This convoluted architecture was hard to follow when I was looking into #5333.

Now metrics that correspond to each component are owned by that component and updated as changes are made instead of on tick every 1100ms.

This does add some annoying boilerplate that is a bit harder to maintain, but it separates metrics more nicely and if someone queries them more frequently will give arbitrary resolution. Since metrics updates are just atomic operations I do not expect any performance impact of these changes.

Will add prdoc if changes look good otherwise.

P.S. I noticed that importing requests (and corresponding metrics) were not cleared ever since corresponding code was introduced in dc41558#r145518721 and I left it as is to not change the behavior, but it might be something worth fixing.

cc @dmitry-markin

Copy link
Contributor

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you! With updating metrics on every operation it's more difficult to check every update is handled. Hope you have searched through the code and made sure everything is covered.

if let Some(metrics) = &self.metrics {
metrics.failed.set(self.failed_requests.len().try_into().unwrap_or(u64::MAX));
metrics.pending.inc();
}
} else {
trace!(target: LOG_TARGET,
Copy link
Contributor

@dmitry-markin dmitry-markin Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope of this PR, but it looks like we can legitimately hit this if the response was delivered after the peer disconnected event arrived. This can happen depending on the order different protocols/streams are polled as these events are emitted by different objects (notifications protocol for sync peers and request-response protocol for responses).

Created an issue for this: #5414.

Copy link
Contributor Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope you have searched through the code and made sure everything is covered.

I checked all occurrences and invariants and I don't think I missed anything.

@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Aug 20, 2024
Comment on lines 189 to +192
if let Some(request) = self.active_requests.remove(&who) {
if let Some(metrics) = &self.metrics {
metrics.active.dec();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would grouping the request hashmaps (or vec) with metrics help here?
Some time in the future we might do a active_requests.remove() and forget to decrement the appropriate metric.

One suggestion might be to group them into a wrapper:

struct ActiveRequestsMetered {
    inner: HashMap<..>, // Current self.active_requests
    metrics: Option<Metrics>,
}

Maybe an even simpler approach would be to introduce a fn active_requests_remove() { ...; metrics.active.dec(); }.
Could also be a followup if would take too long to implement :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several of these in the code and I'm not sure how much it would help with ergonomics to be completely honest. Metrics are one of those things that are just inherently invasive unfortunately.

Copy link
Contributor

@lexnv lexnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for contributing!

@nazar-pc
Copy link
Contributor Author

I resolved minor merge conflict, but also had to restore useless tick. Without this tick tests like sync::multiple_requests_are_accepted_as_long_as_they_are_not_fulfilled never finish. I didn't figure out yet what is not being polled correctly without the tick, but something just hangs in syncing engine and is never able to make it past this:

if net.peer(1).client().justifications(hashof10).unwrap() !=
Some(Justifications::from((*b"FRNK", Vec::new())))
{
return Poll::Pending
}

If someone with more knowledge in this area can take a look at it that'd be great, but it shouldn't block this PR anymore.

@nazar-pc
Copy link
Contributor Author

CI seems to be good now

@dmitry-markin dmitry-markin added this pull request to the merge queue Aug 23, 2024
Merged via the queue into paritytech:master with commit 4057ccd Aug 23, 2024
187 of 189 checks passed
@nazar-pc nazar-pc deleted the reactive-sync-metrics branch August 23, 2024 13:40
ordian added a commit that referenced this pull request Aug 27, 2024
* master: (36 commits)
  Bump the ci_dependencies group across 1 directory with 2 updates (#5401)
  Remove deprecated calls in cumulus-parachain-system (#5439)
  Make the PR template a default for new PRs (#5462)
  Only log the propagating transactions when they are not empty (#5424)
  [CI] Fix SemVer check base commit (#5361)
  Sync status refactoring (#5450)
  Add build options to the srtool build step (#4956)
  `MaybeConsideration` extension trait for `Consideration` (#5384)
  Skip slot before creating inherent data providers during major sync (#5344)
  Add symlinks for code of conduct and contribution guidelines (#5447)
  pallet-collator-selection: correctly register weight in `new_session` (#5430)
  Derive `Clone` on `EncodableOpaqueLeaf` (#5442)
  Moving `Find FAIL-CI` check to GHA (#5377)
  Remove panic, as proof is invalid. (#5427)
  Reactive syncing metrics (#5410)
  [bridges] Prune messages from confirmation tx body, not from the on_idle (#5006)
  Change the chain to Rococo in the parachain template Zombienet config (#5279)
  Improve the appearance of crates on `crates.io` (#5243)
  Add initial version of `pallet_revive` (#5293)
  Update OpenZeppelin template documentation (#5398)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants