Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change getHealth to compare optimistically confirmed slots #33651

Merged
merged 11 commits into from
Oct 16, 2023

Conversation

steviez
Copy link
Contributor

@steviez steviez commented Oct 11, 2023

Problem

The current getHealth mechanism checks a local accounts hash slot vs. those of other nodes as specified by --known-validator. This is a very coarse comparison given that the default for this value is 100 slots. More so, any nodes using a value larger than the default (ie --incremental-snapshot-interval 500) will 100% see getHealth return status behind at some point.

Summary of Changes

Change the underlying mechanism of how health is computed. Instead of using the accounts hash slots published in gossip, use the latest optimistically confirmed slot from the cluster. The cluster confirmed slot can be compared against the latest optimistically confirmed bank to determine how far behind the node is. This new comparison is much more granular than the old approach, and also has the benefit of not going haywire if a known validator has issues (no domino effect).

Testing

With this branch, I observed a node reporting a diminishing "slots behind" value as it caught up from a process restart. To be sure, I also added a hack to stall ReplayStage for 30 seconds every 1000 slots. Stalling for 30 seconds should result in my node falling behind about 60-75 slots; this hack plus passing --health-check-slot-distance 25 allowed me to see the node fall behind (as reported by solana-validator monitor) immediately after the 30s sleep, but then quickly catch up and continue to report a healthy status.

Fixes #16957

@steviez steviez added the noCI Suppress CI on this Pull Request label Oct 11, 2023
@steviez steviez force-pushed the fix_get_health branch 2 times, most recently from 784126f to 570f4a5 Compare October 12, 2023 04:21
@steviez steviez changed the title [WIP] Change getHealth to compare optimistic confirmed slot to local root Change getHealth to compare optimistically confirmed slots Oct 12, 2023
@steviez steviez removed the noCI Suppress CI on this Pull Request label Oct 12, 2023
@steviez steviez marked this pull request as ready for review October 12, 2023 05:02
Steven Czabaniuk added 7 commits October 12, 2023 16:13
Doing so allows for a health_check_slot_distance of 0. While this may be
stricter than many operator would want to actually use on nodes,
TestValidator uses a value of 0 so we need this for unit tests.
@steviez steviez requested a review from mvines October 12, 2023 22:03
test-validator/src/lib.rs Outdated Show resolved Hide resolved
@steviez steviez requested a review from jbiseda October 12, 2023 22:05
test-validator/src/lib.rs Outdated Show resolved Hide resolved
@mvines
Copy link
Member

mvines commented Oct 12, 2023

Wait, this actually works? That is, a nodes knows the highest optimistically confirmed slot even if the highest bank it's currently replaying is older than that slot? 🤔

@steviez
Copy link
Contributor Author

steviez commented Oct 12, 2023

Wait, this actually works? That is, a nodes knows the highest optimistically confirmed slot even if the highest bank it's currently replaying is older than that slot? 🤔

Yessir!

@mvines
Copy link
Member

mvines commented Oct 12, 2023

Yessir!

via monitoring gossip votes?

rpc/src/rpc_health.rs Show resolved Hide resolved
docs/src/api/http.md Outdated Show resolved Hide resolved
test-validator/src/lib.rs Outdated Show resolved Hide resolved
@steviez
Copy link
Contributor Author

steviez commented Oct 13, 2023

Yessir!

via monitoring gossip votes?

Yep, here is a graph of the optimistic_slot datapoint really zoomed in (1 min) while the node was catching up:
image

This metric is reported whenever a new optimistic slot is recorded in the blockstore. Note how it is oscillating back and forth between two "bounds". The upper bound line corresponds to optimistic slots from gossip; the lower bound line corresponds to optimistic slots from replaying blocks.

datapoint_info!("optimistic_slot", ("slot", new_optimistic_slot, i64),);

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

Merging #33651 (fc3649e) into master (452fd5d) will increase coverage by 0.0%.
Report is 10 commits behind head on master.
The diff coverage is 97.1%.

@@           Coverage Diff           @@
##           master   #33651   +/-   ##
=======================================
  Coverage    81.8%    81.8%           
=======================================
  Files         806      806           
  Lines      217595   217574   -21     
=======================================
+ Hits       178065   178093   +28     
+ Misses      39530    39481   -49     

Copy link
Contributor

@jbiseda jbiseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving this!

@t-nelson
Copy link
Contributor

t-nelson commented Oct 13, 2023

can we run a node with this change, minus the removal of the old version, that polls itself ever 5-10s and submits both the health distances via metrics?

@steviez
Copy link
Contributor Author

steviez commented Oct 14, 2023

can we run a node with this change, minus the removal of the old version, that polls itself ever 5-10s and submits both the health distances via metrics?

Sure. I tracked old and new where old is the existing mechanism and new is the mechanism added by this PR:

  • my_slot: what slot the node reporting metrics is at
  • cluster_slot: what slot the node's comparison is at
  • distance: cluster_slot - my_slot
  • healthy: whether the mechanism reported the node as healthy (1 is healthy, 0 is unknown/behind)

Lastly, I left incremental snapshot interval at 100 slots and dropped the health check slot distance to 50. I left solana-validator monitor running in a tmux session, which issues a getHealth request every 2 seconds. The node was restarted around 02:00 UTC (10/14) and finished init at 02:13:

[2023-10-14T02:13:57.583375572Z INFO  solana_metrics::metrics] datapoint: validator-new id="82DZWerqJrHMVYgvGH9RViWsUA9wVo9x66hdAwprqvC9" version="1.18.0 (src:00000000; feat:3079302975, client:SolanaLabs)" cluster_type=1i

The mechanism from this PR reported the node as healthy starting just before 02:26 whereas the old mechanism stabilized around 02:27:30
image

First off, here is a blow up of the legend to see which color is which trace:
image
And with that, here are the traces:
image
Some observations:

  • In general, the old and new mechanisms track fairly well in terms of general slope of the traces
  • The new mechanism shows a smooth ramp up whereas the old mechanism shows a "stair-step"; this is expected given that the old mechanism is dependent on account hashes which will be pushed every 100 slots by default
  • The new mechanism is slightly "ahead" in terms of absolute slot; this is also expected given that the time to take an accounts hash and publish it introduce some latency. Given that we're comparing apples-to-apples, the latency is a non-issue (whereas if we compared the cluster optimistic slot to my node's local latest account hash, that would obviously be problematic)

Lastly, here are the distance plots; technically the same information as above just with a different way of viewing:
image

So, things are somewhat similar on capable machine. One instance where the new mechanism is night and day better is with an incremental snapshot interval. For example, here is the new vs. old healthy if I use an incremental snapshot (and thus accounts hash) interval of 250 slots:
image
The square wave is expected for the old mechanism. With the old mechanism, a node will view itself as healthy immediately after publishing a new accounts hash. But, once the comparison nodes publish another hash, the node will view itself as behind for the period until it publishes another one.

Technically, me using a smaller-than-default (50) health check distance lengthens the duration of "unhealthyness". However, the old health mechanism would still exhibit the toggling between healthy/unhealthy in this case. But, the new mechanism would allow tightening this limit significantly without being dependent on known validators.

@t-nelson
Copy link
Contributor

awesome! let's get this in!

@steviez steviez merged commit 8bd0e4c into solana-labs:master Oct 16, 2023
16 checks passed
@steviez steviez deleted the fix_get_health branch October 16, 2023 16:21
@steviez steviez added the v1.17 PRs that should be backported to v1.17 label Oct 16, 2023
mergify bot pushed a commit that referenced this pull request Oct 16, 2023
The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)
jeffwashington pushed a commit to jeffwashington/solana that referenced this pull request Oct 16, 2023
…bs#33651)

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.
steviez pushed a commit that referenced this pull request Oct 16, 2023
…ckport of #33651) (#33714)

Change getHealth to compare optimistically confirmed slots (#33651)

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)

Co-authored-by: steviez <steven@solana.com>
@steviez steviez added the v1.16 PRs that should be backported to v1.16 label Oct 16, 2023
mergify bot pushed a commit that referenced this pull request Oct 16, 2023
The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)

# Conflicts:
#	rpc/src/rpc.rs
steviez pushed a commit that referenced this pull request Oct 20, 2023
…ckport of #33651) (#33716)

* Change getHealth to compare optimistically confirmed slots (#33651)

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

(cherry picked from commit 8bd0e4c)

# Conflicts:
#	rpc/src/rpc.rs

* Resolve merge conflicts

---------

Co-authored-by: steviez <steven@solana.com>
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Nov 15, 2023
…bs#33651)

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.
willhickey pushed a commit that referenced this pull request Dec 11, 2023
* feat: moved common docs to  repo

* refactor: removed sidebar items

* refactor: removed unused images

* fix: terminology link

* fix: introduction links

* fix: developing links

* refactor: fixed assorted links

* fix: added back the home index

* refactor: home page links

* refactor: primary links

* fix: links

* fix: updated existing redirects

* feat: added new redirects

* refactor: moved cli index file to cli folder

* feat: turned breadcrumbs on

* feat: auto generated cli sidebar

* refactor: page titles

* feat: added usage and wallets categories

* refactor: moved wallet-guide/cli

* style: page titles

* refactor: renamed file to install

* style: page title

* refactor: relocated file to cli/usage/index.md

* style: page title

* refactor: relocat detailed usage generator for cli commands

* refactor: relocated clie usage files

* refactor: relocated paper wallet file

* refactor: relocated file system wallet doc

* feat: added hardware wallet category

* refactor: relocated hardware wallet overview

* refactor: relocated ledger wallet doc

* style: clie wallet titles

* refactor(revert): relocated cli usage doc

* refactor: relocated to examples

* style: cli examples category title

* style: usage doc title

* refactor: relocated cli intro doc

* style: category title

* refactor: renamed file

* refactor: renamed file

* fix: cli links

* refactor: relocated file

* refactor: relocated files

* fix: more cli links

* refactor: sidebar order

* fix: final cli links?

* refactor: proposals

* refactor: split sidebars

* refactor: removed unused icons

* refactor: relocated file

* refactor: relocated file

* refactor: relocated file

* refactor: relocated file

* feat: added architecture page

* refactor: reloacted filed

* refactor: adjusted header links

* style: sidebar labels

* feat: clusters sidebar details

* style: sidebar label

* refactor: relocate file

* refactor: relocated files

* refactor: relocated files

* refactor: relocated files

* style: validator sidebar

* style: sidebar styles

* refactor: internal links

* style: sidebar order

* fix: internal links

* feat: master sidebar

* refactor: removed unneeded h2

* fix: link redirects

* refactor: relocated pages

* style: runtime links

* refactor: simplified runtime redirects

* fix: internal redirect

* refactor: moved proposals to dropdown

* docs: Removes accounts-on-ramdisk section (#33655)

* RPC: update websocket docs (#33460)

* [rpc]: update websocket docs

* rename rewards to showRewards

* add remaining optional fields for slotsUpdates

* update block subscription showRewards

* Change getHealth to compare optimistically confirmed slots (#33651)

The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.

Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.

Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.

* build(deps): bump @babel/traverse from 7.19.6 to 7.23.2 in /docs (#33726)

Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.19.6 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse)

---
updated-dependencies:
- dependency-name: "@babel/traverse"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs: move rpc info to rpc docs (#33723)

docs: link fixes

docs: link fixes

docs: link fixes

* Fix typos in documentation for Secp256k1 native program (#33796)

* docs: outline requirement of stake in order to vote (#33842)

* docs: outline requirement of stake in order to vote

* pr feedback: move stake section up

* chore: fix some typos (#33833)

* fix spelling of "retrieved"
* fix spelling of "should"
* fix spelling of "comparisons"

* docs: updating apt install to apt upgrade (#33920)

* Fix some typo in the documentation (#34058)

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

* fix: internal links

* refactor: removed rpc api docs

* refactor: removed rpc sidebar

* fix: updated remaining rpc api links

* refactor: removed final rpc /api route

* refactor: removed dangling component files

* refactor: changed copyright

* fix: dangling ordered list

* refactor: wording around solana docs

* feat: home page content

* refactor: updated docs url

* Link to latest version of the off-chain message signing proposal in the docs (#34329)

* docs: (cli) minor updates to deploy-a-program.md (#34307)

* docs: (cli) minor updates to deploy-a-program.md

* address review comments

* remove unnecessary impl details from the docs about deploy command upgrade flow

* clarify program redeploy section

---------

Co-authored-by: norwnd <norwnd>

* refactor: removed GA

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Brooks <brooks@solana.com>
Co-authored-by: Joe C <joe.caulfield@solana.com>
Co-authored-by: steviez <steven@solana.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jacob Creech <82475023+jacobcreech@users.noreply.github.com>
Co-authored-by: Nick Guo <1387955+nickguo@users.noreply.github.com>
Co-authored-by: Ashwin Sekar <ashwin@solana.com>
Co-authored-by: Kevin Heavey <24635973+kevinheavey@users.noreply.github.com>
Co-authored-by: Max Kaplan <max@maxkaplan.me>
Co-authored-by: hugo-syn <61210734+hugo-syn@users.noreply.github.com>
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
Co-authored-by: norwnd <112318969+norwnd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.16 PRs that should be backported to v1.16 v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catchup & RPC getHealth Results Do Not Match
4 participants