Change getHealth to compare optimistically confirmed slots #33651

steviez · 2023-10-11T06:14:08Z

Problem

The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will 100% see getHealth return status behind at some point.

Summary of Changes

Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. The cluster confirmed slot can be compared against the latest optimistically confirmed bank to determine how far behind the node is. This new comparison is much more granular than the old approach, and also has the benefit of not going haywire if a known validator has issues (no domino effect).

Testing

With this branch, I observed a node reporting a diminishing "slots behind" value as it caught up from a process restart. To be sure, I also added a hack to stall ReplayStage for 30 seconds every 1000 slots. Stalling for 30 seconds should result in my node falling behind about 60-75 slots; this hack plus passing --health-check-slot-distance 25 allowed me to see the node fall behind (as reported by solana-validator monitor) immediately after the 30s sleep, but then quickly catch up and continue to report a healthy status.

Fixes #16957

Doing so allows for a health_check_slot_distance of 0. While this may be stricter than many operator would want to actually use on nodes, TestValidator uses a value of 0 so we need this for unit tests.

test-validator/src/lib.rs

mvines · 2023-10-12T23:31:38Z

Wait, this actually works? That is, a nodes knows the highest optimistically confirmed slot even if the highest bank it's currently replaying is older than that slot? 🤔

steviez · 2023-10-12T23:32:41Z

Wait, this actually works? That is, a nodes knows the highest optimistically confirmed slot even if the highest bank it's currently replaying is older than that slot? 🤔

Yessir!

mvines · 2023-10-12T23:40:56Z

Yessir!

via monitoring gossip votes?

rpc/src/rpc_health.rs

docs/src/api/http.md

test-validator/src/lib.rs

steviez · 2023-10-13T03:51:46Z

Yessir!

via monitoring gossip votes?

Yep, here is a graph of the optimistic_slot datapoint really zoomed in (1 min) while the node was catching up:

This metric is reported whenever a new optimistic slot is recorded in the blockstore. Note how it is oscillating back and forth between two "bounds". The upper bound line corresponds to optimistic slots from gossip; the lower bound line corresponds to optimistic slots from replaying blocks.

solana/core/src/optimistic_confirmation_verifier.rs

Line 87 in 47540af

datapoint_info!("optimistic_slot", ("slot", new_optimistic_slot, i64),);

This reverts commit df6689a.

codecov · 2023-10-13T20:06:24Z

Codecov Report

Merging #33651 (fc3649e) into master (452fd5d) will increase coverage by 0.0%.
Report is 10 commits behind head on master.
The diff coverage is 97.1%.

@@           Coverage Diff           @@
##           master   #33651   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         806      806           
  Lines      217595   217574   -21     
=======================================
+ Hits       178065   178093   +28     
+ Misses      39530    39481   -49

jbiseda

Thanks for improving this!

t-nelson · 2023-10-13T23:08:30Z

can we run a node with this change, minus the removal of the old version, that polls itself ever 5-10s and submits both the health distances via metrics?

steviez · 2023-10-14T05:25:56Z

can we run a node with this change, minus the removal of the old version, that polls itself ever 5-10s and submits both the health distances via metrics?

Sure. I tracked old and new where old is the existing mechanism and new is the mechanism added by this PR:

my_slot: what slot the node reporting metrics is at
cluster_slot: what slot the node's comparison is at
distance: cluster_slot - my_slot
healthy: whether the mechanism reported the node as healthy (1 is healthy, 0 is unknown/behind)

Lastly, I left incremental snapshot interval at 100 slots and dropped the health check slot distance to 50. I left solana-validator monitor running in a tmux session, which issues a getHealth request every 2 seconds. The node was restarted around 02:00 UTC (10/14) and finished init at 02:13:

[2023-10-14T02:13:57.583375572Z INFO  solana_metrics::metrics] datapoint: validator-new id="82DZWerqJrHMVYgvGH9RViWsUA9wVo9x66hdAwprqvC9" version="1.18.0 (src:00000000; feat:3079302975, client:SolanaLabs)" cluster_type=1i

The mechanism from this PR reported the node as healthy starting just before 02:26 whereas the old mechanism stabilized around 02:27:30

First off, here is a blow up of the legend to see which color is which trace:

And with that, here are the traces:

Some observations:

In general, the old and new mechanisms track fairly well in terms of general slope of the traces
The new mechanism shows a smooth ramp up whereas the old mechanism shows a "stair-step"; this is expected given that the old mechanism is dependent on account hashes which will be pushed every 100 slots by default
The new mechanism is slightly "ahead" in terms of absolute slot; this is also expected given that the time to take an accounts hash and publish it introduce some latency. Given that we're comparing apples-to-apples, the latency is a non-issue (whereas if we compared the cluster optimistic slot to my node's local latest account hash, that would obviously be problematic)

Lastly, here are the distance plots; technically the same information as above just with a different way of viewing:

So, things are somewhat similar on capable machine. One instance where the new mechanism is night and day better is with an incremental snapshot interval. For example, here is the new vs. old healthy if I use an incremental snapshot (and thus accounts hash) interval of 250 slots:

The square wave is expected for the old mechanism. With the old mechanism, a node will view itself as healthy immediately after publishing a new accounts hash. But, once the comparison nodes publish another hash, the node will view itself as behind for the period until it publishes another one.

Technically, me using a smaller-than-default (50) health check distance lengthens the duration of "unhealthyness". However, the old health mechanism would still exhibit the toggling between healthy/unhealthy in this case. But, the new mechanism would allow tightening this limit significantly without being dependent on known validators.

t-nelson · 2023-10-16T16:13:06Z

awesome! let's get this in!

The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c)

…bs#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus.

…ckport of #33651) (#33714) Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) Co-authored-by: steviez <steven@solana.com>

The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) # Conflicts: # rpc/src/rpc.rs

…ckport of #33651) (#33716) * Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. (cherry picked from commit 8bd0e4c) # Conflicts: # rpc/src/rpc.rs * Resolve merge conflicts --------- Co-authored-by: steviez <steven@solana.com>

…bs#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus.

* feat: moved common docs to repo * refactor: removed sidebar items * refactor: removed unused images * fix: terminology link * fix: introduction links * fix: developing links * refactor: fixed assorted links * fix: added back the home index * refactor: home page links * refactor: primary links * fix: links * fix: updated existing redirects * feat: added new redirects * refactor: moved cli index file to cli folder * feat: turned breadcrumbs on * feat: auto generated cli sidebar * refactor: page titles * feat: added usage and wallets categories * refactor: moved wallet-guide/cli * style: page titles * refactor: renamed file to install * style: page title * refactor: relocated file to cli/usage/index.md * style: page title * refactor: relocat detailed usage generator for cli commands * refactor: relocated clie usage files * refactor: relocated paper wallet file * refactor: relocated file system wallet doc * feat: added hardware wallet category * refactor: relocated hardware wallet overview * refactor: relocated ledger wallet doc * style: clie wallet titles * refactor(revert): relocated cli usage doc * refactor: relocated to examples * style: cli examples category title * style: usage doc title * refactor: relocated cli intro doc * style: category title * refactor: renamed file * refactor: renamed file * fix: cli links * refactor: relocated file * refactor: relocated files * fix: more cli links * refactor: sidebar order * fix: final cli links? * refactor: proposals * refactor: split sidebars * refactor: removed unused icons * refactor: relocated file * refactor: relocated file * refactor: relocated file * refactor: relocated file * feat: added architecture page * refactor: reloacted filed * refactor: adjusted header links * style: sidebar labels * feat: clusters sidebar details * style: sidebar label * refactor: relocate file * refactor: relocated files * refactor: relocated files * refactor: relocated files * style: validator sidebar * style: sidebar styles * refactor: internal links * style: sidebar order * fix: internal links * feat: master sidebar * refactor: removed unneeded h2 * fix: link redirects * refactor: relocated pages * style: runtime links * refactor: simplified runtime redirects * fix: internal redirect * refactor: moved proposals to dropdown * docs: Removes accounts-on-ramdisk section (#33655) * RPC: update websocket docs (#33460) * [rpc]: update websocket docs * rename rewards to showRewards * add remaining optional fields for slotsUpdates * update block subscription showRewards * Change getHealth to compare optimistically confirmed slots (#33651) The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will likely see getHealth return status behind at some point. Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. Even when a node is behind, it is able to observe cluster optimistically confirmed by slots by viewing votes published in gossip. Thus, the latest cluster optimistically confirmed slot can be compared against the latest optimistically confirmed bank from replay to determine health. This new comparison is much more granular, and not needing to depend on individual known validators is also a plus. * build(deps): bump @babel/traverse from 7.19.6 to 7.23.2 in /docs (#33726) Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.19.6 to 7.23.2. - [Release notes](https://github.com/babel/babel/releases) - [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md) - [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse) --- updated-dependencies: - dependency-name: "@babel/traverse" dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * docs: move rpc info to rpc docs (#33723) docs: link fixes docs: link fixes docs: link fixes * Fix typos in documentation for Secp256k1 native program (#33796) * docs: outline requirement of stake in order to vote (#33842) * docs: outline requirement of stake in order to vote * pr feedback: move stake section up * chore: fix some typos (#33833) * fix spelling of "retrieved" * fix spelling of "should" * fix spelling of "comparisons" * docs: updating apt install to apt upgrade (#33920) * Fix some typo in the documentation (#34058) Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com> * fix: internal links * refactor: removed rpc api docs * refactor: removed rpc sidebar * fix: updated remaining rpc api links * refactor: removed final rpc /api route * refactor: removed dangling component files * refactor: changed copyright * fix: dangling ordered list * refactor: wording around solana docs * feat: home page content * refactor: updated docs url * Link to latest version of the off-chain message signing proposal in the docs (#34329) * docs: (cli) minor updates to deploy-a-program.md (#34307) * docs: (cli) minor updates to deploy-a-program.md * address review comments * remove unnecessary impl details from the docs about deploy command upgrade flow * clarify program redeploy section --------- Co-authored-by: norwnd <norwnd> * refactor: removed GA --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Brooks <brooks@solana.com> Co-authored-by: Joe C <joe.caulfield@solana.com> Co-authored-by: steviez <steven@solana.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jacob Creech <82475023+jacobcreech@users.noreply.github.com> Co-authored-by: Nick Guo <1387955+nickguo@users.noreply.github.com> Co-authored-by: Ashwin Sekar <ashwin@solana.com> Co-authored-by: Kevin Heavey <24635973+kevinheavey@users.noreply.github.com> Co-authored-by: Max Kaplan <max@maxkaplan.me> Co-authored-by: hugo-syn <61210734+hugo-syn@users.noreply.github.com> Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com> Co-authored-by: norwnd <112318969+norwnd@users.noreply.github.com>

steviez added the noCI Suppress CI on this Pull Request label Oct 11, 2023

steviez force-pushed the fix_get_health branch 2 times, most recently from 784126f to 570f4a5 Compare October 12, 2023 04:21

steviez changed the title ~~[WIP] Change getHealth to compare optimistic confirmed slot to local root~~ Change getHealth to compare optimistically confirmed slots Oct 12, 2023

steviez removed the noCI Suppress CI on this Pull Request label Oct 12, 2023

steviez marked this pull request as ready for review October 12, 2023 05:02

steviez force-pushed the fix_get_health branch from 90597ed to 108cb32 Compare October 12, 2023 06:55

Steven Czabaniuk added 7 commits October 12, 2023 16:13

Change getHealth to compare optimistically confirmed slots

8d86c67

Remove unit tests that are no longer relevant

597364e

Fix unit tests

931f179

Update docs to remove mention of known validators

c31694c

Update solana-validator CLI

1beba1d

Switch distance check to be >= instead of just >

8134754

Doing so allows for a health_check_slot_distance of 0. While this may be stricter than many operator would want to actually use on nodes, TestValidator uses a value of 0 so we need this for unit tests.

Fix race condition that caused CI to fail

df6689a

steviez force-pushed the fix_get_health branch from 99fde92 to df6689a Compare October 12, 2023 21:13

steviez requested a review from mvines October 12, 2023 22:03

steviez commented Oct 12, 2023

View reviewed changes

test-validator/src/lib.rs Outdated Show resolved Hide resolved

steviez requested a review from jbiseda October 12, 2023 22:05

steviez commented Oct 12, 2023

View reviewed changes

test-validator/src/lib.rs Outdated Show resolved Hide resolved

Add unit test

82fa51c

jbiseda reviewed Oct 13, 2023

View reviewed changes

rpc/src/rpc_health.rs Show resolved Hide resolved

docs/src/api/http.md Outdated Show resolved Hide resolved

test-validator/src/lib.rs Outdated Show resolved Hide resolved

jbiseda mentioned this pull request Oct 13, 2023

Warn if vaildator appears to be repairing excessively #33320

Closed

Steven Czabaniuk added 3 commits October 13, 2023 14:02

Add a flag to config to disable health check off the bat

6b63925

Revert "Fix race condition that caused CI to fail"

d854515

This reverts commit df6689a.

review comments

fc3649e

jbiseda approved these changes Oct 13, 2023

View reviewed changes

steviez merged commit 8bd0e4c into solana-labs:master Oct 16, 2023
16 checks passed

steviez deleted the fix_get_health branch October 16, 2023 16:21

steviez added the v1.17 PRs that should be backported to v1.17 label Oct 16, 2023

mergify bot mentioned this pull request Oct 16, 2023

v1.17: Change getHealth to compare optimistically confirmed slots (backport of #33651) #33714

Merged

steviez added the v1.16 PRs that should be backported to v1.16 label Oct 16, 2023

mergify bot mentioned this pull request Oct 16, 2023

v1.16: Change getHealth to compare optimistically confirmed slots (backport of #33651) #33716

Merged

HaoranYi mentioned this pull request Apr 8, 2024

pr634 1.17 new filter anza-xyz/agave#657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change getHealth to compare optimistically confirmed slots #33651

Change getHealth to compare optimistically confirmed slots #33651

steviez commented Oct 11, 2023 •

edited

Loading

mvines commented Oct 12, 2023

steviez commented Oct 12, 2023

mvines commented Oct 12, 2023

steviez commented Oct 13, 2023

codecov bot commented Oct 13, 2023

jbiseda left a comment

t-nelson commented Oct 13, 2023 •

edited

Loading

steviez commented Oct 14, 2023

t-nelson commented Oct 16, 2023

Change getHealth to compare optimistically confirmed slots #33651

Change getHealth to compare optimistically confirmed slots #33651

Conversation

steviez commented Oct 11, 2023 • edited Loading

Problem

Summary of Changes

Testing

mvines commented Oct 12, 2023

steviez commented Oct 12, 2023

mvines commented Oct 12, 2023

steviez commented Oct 13, 2023

codecov bot commented Oct 13, 2023

Codecov Report

jbiseda left a comment

Choose a reason for hiding this comment

t-nelson commented Oct 13, 2023 • edited Loading

steviez commented Oct 14, 2023

t-nelson commented Oct 16, 2023

steviez commented Oct 11, 2023 •

edited

Loading

t-nelson commented Oct 13, 2023 •

edited

Loading