Update Watchtower to support multiple RPC sources #4621

willhickey · 2025-01-24T18:51:15Z

Problem

Watchtower monitors cluster health and node delinquency. It does this by polling some RPC endpoints, but the current implementation only uses a single RPC url. That RPC service becomes a single point of failure.

Proposed Solution

Upate Watchtower config and logic to support:

Multiple RPC URLs
A way to specify how many RPC service results must cross a threshold before an alert is triggered (eg, a common setup would be 3 RPC URLS, and problematic results from any 2 of the 3 will trigger alerts.

nk4rter · 2025-01-26T11:42:34Z

I can help with this issue. Here are some questions for clarification:

Do we take master branch as a base for this change?
Let's say that we specified that 2/3 RPC service results need to confirm the problem in order to trigger alerts. What should we do if only some of the are reachable? Here are some example cases:
a. 1/3 is reachable and it reports a problem (100% of reachable nodes report a problem)
b. 2/3 are reachable but only one reports a problem (50% of reachable nodes report a problem)
Or do we simply ignore the unreachable RPC services?
json_rpc_url is a part of solana_cli_config::Config, and it's a single URL. As far as I can see, defaults from solana_cli_config::Config are used to initialize the watchtower config. So do we need to add a new argument that will accept a list of endpoints and use the default one only if the provided list is empty?

steviez · 2025-01-27T04:52:17Z

I'll let Will comment on the other items but

Do we take master branch as a base for this change?

Yes

t-nelson · 2025-01-27T22:42:01Z

you're effectively implementing https://en.wikipedia.org/wiki/Triple_modular_redundancy. we want threshold endpoints to agree that we're healthy not faulty, despite OP

I can help with this issue. Here are some questions for clarification:
1. Do we take `master` branch as a base for this change?

yes

2. Let's say that we specified that 2/3 RPC service results need to confirm the problem in order to trigger alerts. What should we do if only some of the are reachable? Here are some example cases:
   a. 1/3 is reachable and it reports a problem (100% of reachable nodes report a problem)
   b. 2/3 are reachable but only one reports a problem (50% of reachable nodes report a problem)
   Or do we simply ignore the unreachable RPC services?

definitely do not ignore, that would defeat the purpose of the change

each endpoint returning network/http errors should raise a new alert type indicating unavailable data source
raise an alert whenever threshold endpoints do not agree that the system is healthy
examples:
- one ep http 502, one indicating tx count stall, one indicating healthy -- raise tx count stall alert and 502 endpoing alert
- threshold+ endpoints return networking/http errors -- raise watchtower reliability alert

3. `json_rpc_url` is a part of `solana_cli_config::Config`, and it's a single URL. As far as I can see, defaults from `solana_cli_config::Config` are used to initialize the watchtower config. So do we need to add a new argument that will accept a list of endpoints and use the default one only if the provided list is empty?

the cli config should be should be modified as follows:

query the CLI Config before configuring the clap::App, use the rpc url here as the .default_value() for the "json_rpc_url" clap::Arg
configure "json_rpc_url" clap::Arg to accept multiple instance
- num instances must be one, or 3+ (accepting two is error prone)
add threshold arg (--min-agreeing-endpoints?) with some value sanity checks
- required if --url instances > 1
- min value 2
- maybe warn if 2 <= threshold <= (num_rpc_urls / 2 + 1) ?
  NOTE: we use clap v2.34.0, so be sure to reference the correct docs. i think some of the requirements here are too complex to use normal clap arg validators, unfortunately

nk4rter · 2025-01-27T23:29:47Z

@t-nelson Okay, this all makes total sense. Thanks for the detailed description!
Is it feasible to require json_rpc_url to have an odd amount of values so that ties cannot happen?

willhickey · 2025-01-28T03:58:01Z

Is it feasible to require json_rpc_url to have an odd amount of values so that ties cannot happen?

I think ties are fine... the desired behavior in the event of a tie would be dictated by the number of RPCs provided and the value of --min-agreeing-endpoints

t-nelson · 2025-01-28T05:16:29Z

tbh there's probably little gain in accepting more than three rpc sources. anyone who wants to is probably missing the point and shouldn't be using more than one. it's not like we're sending this thing into space for centuries. this would actually simplify the cli config

--url remains untouched
introduce --url-set requiring exactly three values
--url and --url-set conflict
--min-agreeing-endpoints is now unnecessary as the only sane values are one for --url and two for --url-set

then genericization of the exiting logic is identical

nk4rter · 2025-01-28T09:15:09Z

@t-nelson I like the idea of just allowing --url to accept multiple arguments. And considering that there's little benefit from requiring too many nodes to agree, we can just always expect num_rpc_urls / 2 + 1 to confirm health. This approach doesn't require any additional arguments and seems very simple from the user standpoint. Also restricting the maximum number of endpoints wouldn't really simplify the logic of watchtower anyway...
Would in this case allowing 2 endpoints still be bad? They both would have to agree. Maybe it would be easier to just clearly state in the docs that 1 or 3 is advised but, technically allow 2, because this would allow for simpler user interface?

t-nelson · 2025-01-29T03:27:06Z

"never go to sea with two chronometers. take one, or three"

nk4rter · 2025-01-30T23:58:01Z

Shouldn't here be a >=?
https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L386
Or is this intentional?
Default value of unhealthy_threshold is 1, so we always skip the first one currently.

t-nelson · 2025-01-31T01:53:22Z

i think the logic is correct as is. threshold zero (no appetite) should alert the first one

nk4rter · 2025-02-01T13:00:30Z

It looks like I'm almost done with the implementation...
A couple of questions:

Let's say we have 3 endpoints: 1 is unreachable, 1 says transaction count is not advancing, 1 says that validator is delinquent. Do we report both alerts as separate datapoint_error! or, do we report a single datapoint_error! and concatenate the messages or something?
Again 3 endpoints: 2 are unreachable, 1 says validator is delinquent. Do we report the datapoint_error! for watchtower-reliability alert and delinquency alert, or do we only report the watchtower-reliability ?
If we got different failure reports from several nodes, but they are skipped by the config.unhealthy_threshold , do we still need to write every message in the info!("Failure {} of {}: log ? It would be convenient not to do that, and all the failures are logged as error! anyway.
Should we report datapoint_info!("watchtower-sanity", ("ok", true, bool)); here as well? https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L398
Because now when we filter the failures by the config.unhealthy_threshold, we stop saying that ok is true as well, even though we don't provide any datapoint_error!s yet.

t-nelson · 2025-02-04T00:38:03Z

It looks like I'm almost done with the implementation... A couple of questions:

1. Let's say we have 3 endpoints: 1 is unreachable, 1 says transaction count is not advancing, 1 says that validator is delinquent. Do we report both alerts as separate `datapoint_error!` or, do we report a single `datapoint_error!` and concatenate the messages or something?

i think just raise watchtower reliability alert. we can't make any sense of the information we've collected

2. Again 3 endpoints: 2 are unreachable, 1 says validator is delinquent. Do we report the `datapoint_error!` for `watchtower-reliability` alert _and_ delinquency alert, or do we only report the `watchtower-reliability`  ?

just reliability

3. If we got different failure reports from several nodes, but they are skipped by the `config.unhealthy_threshold` , do we still need to write every message in the `info!("Failure {} of {}:` log ? It would be convenient not to do that, and all the failures are logged as `error!` anyway.

i think having the log messages will be useful when a watchtower reliability alert is raised

4. Should we report `datapoint_info!("watchtower-sanity", ("ok", true, bool));` here as well? https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L398
   Because now when we filter the failures by the `config.unhealthy_threshold`, we stop saying that ok is true as well, even though we don't provide any `datapoint_error!`s yet.

i think you're thinking about this backwards. we want threshold endpoints to agree that our node and the cluster are healthy. we raise an alert any time that's not the case. attempting to get agreement on the failure mode will be very difficult to get right

nk4rter · 2025-02-04T10:01:09Z

@t-nelson
Okay, I'll only raise the unreliability alert in cases 1 & 2.
In case 3 because of cases 1 & 2, it will be always guaranteed that we only have 1 failure message, but the other errors that were replaced by the unreliability alert will still be present in logs.
With regards to the question 4, I meant a different thing.
So we essentially have 3 branches here:

>=min_agreeing_endpoints endpoints say that the cluster is healthy
<min_agreeing_endpoints endpoints say that the cluster is healthy, but the alert was skipped for config.unhealthy_threshold times (which says how many consecutive failures must occur to trigger a notification).
<min_agreeing_endpoints endpoints say that the cluster is healthy, and alert is not skipped, because it was triggered >config.unhealthy_threshold times.
And what seemed weird to me is that in branch 1 we add a datapoint with ok=true, in branch 3 we add a datapoint with ok=false, but in branch 2 we simply don't add any datapoint at all. So from the point of view of an entity that looks at these datapoints it would appear that watchtower is no longer sending anything for however many iterations that branch will be triggered. And it looks like it would be logical to continue adding ok=true datapoint, until we hit the branch 3.

nk4rter · 2025-02-04T11:45:18Z

I updated the pull request accordingly and moved it out of the draft state, although the CI still needs to be approved :)

nk4rter mentioned this issue Feb 3, 2025

Support multiple RPC URLs in watchtower #4748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Watchtower to support multiple RPC sources #4621

Update Watchtower to support multiple RPC sources #4621

willhickey commented Jan 24, 2025

nk4rter commented Jan 26, 2025

steviez commented Jan 27, 2025

t-nelson commented Jan 27, 2025

nk4rter commented Jan 27, 2025 •

edited

Loading

willhickey commented Jan 28, 2025

t-nelson commented Jan 28, 2025

nk4rter commented Jan 28, 2025

t-nelson commented Jan 29, 2025

nk4rter commented Jan 30, 2025

t-nelson commented Jan 31, 2025

nk4rter commented Feb 1, 2025 •

edited

Loading

t-nelson commented Feb 4, 2025

nk4rter commented Feb 4, 2025 •

edited

Loading

nk4rter commented Feb 4, 2025

Update Watchtower to support multiple RPC sources #4621

Update Watchtower to support multiple RPC sources #4621

Comments

willhickey commented Jan 24, 2025

Problem

Proposed Solution

nk4rter commented Jan 26, 2025

steviez commented Jan 27, 2025

t-nelson commented Jan 27, 2025

nk4rter commented Jan 27, 2025 • edited Loading

willhickey commented Jan 28, 2025

t-nelson commented Jan 28, 2025

nk4rter commented Jan 28, 2025

t-nelson commented Jan 29, 2025

nk4rter commented Jan 30, 2025

t-nelson commented Jan 31, 2025

nk4rter commented Feb 1, 2025 • edited Loading

t-nelson commented Feb 4, 2025

nk4rter commented Feb 4, 2025 • edited Loading

nk4rter commented Feb 4, 2025

nk4rter commented Jan 27, 2025 •

edited

Loading

nk4rter commented Feb 1, 2025 •

edited

Loading

nk4rter commented Feb 4, 2025 •

edited

Loading