Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Watchtower to support multiple RPC sources #4621

Open
willhickey opened this issue Jan 24, 2025 · 14 comments
Open

Update Watchtower to support multiple RPC sources #4621

willhickey opened this issue Jan 24, 2025 · 14 comments

Comments

@willhickey
Copy link

Problem

Watchtower monitors cluster health and node delinquency. It does this by polling some RPC endpoints, but the current implementation only uses a single RPC url. That RPC service becomes a single point of failure.

Proposed Solution

Upate Watchtower config and logic to support:

  • Multiple RPC URLs
  • A way to specify how many RPC service results must cross a threshold before an alert is triggered (eg, a common setup would be 3 RPC URLS, and problematic results from any 2 of the 3 will trigger alerts.
@nk4rter
Copy link

nk4rter commented Jan 26, 2025

I can help with this issue. Here are some questions for clarification:

  1. Do we take master branch as a base for this change?
  2. Let's say that we specified that 2/3 RPC service results need to confirm the problem in order to trigger alerts. What should we do if only some of the are reachable? Here are some example cases:
    a. 1/3 is reachable and it reports a problem (100% of reachable nodes report a problem)
    b. 2/3 are reachable but only one reports a problem (50% of reachable nodes report a problem)
    Or do we simply ignore the unreachable RPC services?
  3. json_rpc_url is a part of solana_cli_config::Config, and it's a single URL. As far as I can see, defaults from solana_cli_config::Config are used to initialize the watchtower config. So do we need to add a new argument that will accept a list of endpoints and use the default one only if the provided list is empty?

@steviez
Copy link

steviez commented Jan 27, 2025

I'll let Will comment on the other items but

Do we take master branch as a base for this change?

Yes

@t-nelson
Copy link

you're effectively implementing https://en.wikipedia.org/wiki/Triple_modular_redundancy. we want threshold endpoints to agree that we're healthy not faulty, despite OP

I can help with this issue. Here are some questions for clarification:

1. Do we take `master` branch as a base for this change?

yes

2. Let's say that we specified that 2/3 RPC service results need to confirm the problem in order to trigger alerts. What should we do if only some of the are reachable? Here are some example cases:
   a. 1/3 is reachable and it reports a problem (100% of reachable nodes report a problem)
   b. 2/3 are reachable but only one reports a problem (50% of reachable nodes report a problem)
   Or do we simply ignore the unreachable RPC services?

definitely do not ignore, that would defeat the purpose of the change

  • each endpoint returning network/http errors should raise a new alert type indicating unavailable data source
  • raise an alert whenever threshold endpoints do not agree that the system is healthy
    examples:
    • one ep http 502, one indicating tx count stall, one indicating healthy -- raise tx count stall alert and 502 endpoing alert
    • threshold+ endpoints return networking/http errors -- raise watchtower reliability alert
3. `json_rpc_url` is a part of `solana_cli_config::Config`, and it's a single URL. As far as I can see, defaults from `solana_cli_config::Config` are used to initialize the watchtower config. So do we need to add a new argument that will accept a list of endpoints and use the default one only if the provided list is empty?

the cli config should be should be modified as follows:

  • query the CLI Config before configuring the clap::App, use the rpc url here as the .default_value() for the "json_rpc_url" clap::Arg
  • configure "json_rpc_url" clap::Arg to accept multiple instance
    • num instances must be one, or 3+ (accepting two is error prone)
  • add threshold arg (--min-agreeing-endpoints?) with some value sanity checks
    • required if --url instances > 1
    • min value 2
    • maybe warn if 2 <= threshold <= (num_rpc_urls / 2 + 1) ?
      NOTE: we use clap v2.34.0, so be sure to reference the correct docs. i think some of the requirements here are too complex to use normal clap arg validators, unfortunately

@nk4rter
Copy link

nk4rter commented Jan 27, 2025

@t-nelson Okay, this all makes total sense. Thanks for the detailed description!
Is it feasible to require json_rpc_url to have an odd amount of values so that ties cannot happen?

@willhickey
Copy link
Author

Is it feasible to require json_rpc_url to have an odd amount of values so that ties cannot happen?

I think ties are fine... the desired behavior in the event of a tie would be dictated by the number of RPCs provided and the value of --min-agreeing-endpoints

@t-nelson
Copy link

tbh there's probably little gain in accepting more than three rpc sources. anyone who wants to is probably missing the point and shouldn't be using more than one. it's not like we're sending this thing into space for centuries. this would actually simplify the cli config

  • --url remains untouched
  • introduce --url-set requiring exactly three values
  • --url and --url-set conflict
  • --min-agreeing-endpoints is now unnecessary as the only sane values are one for --url and two for --url-set

then genericization of the exiting logic is identical

@nk4rter
Copy link

nk4rter commented Jan 28, 2025

@t-nelson I like the idea of just allowing --url to accept multiple arguments. And considering that there's little benefit from requiring too many nodes to agree, we can just always expect num_rpc_urls / 2 + 1 to confirm health. This approach doesn't require any additional arguments and seems very simple from the user standpoint. Also restricting the maximum number of endpoints wouldn't really simplify the logic of watchtower anyway...
Would in this case allowing 2 endpoints still be bad? They both would have to agree. Maybe it would be easier to just clearly state in the docs that 1 or 3 is advised but, technically allow 2, because this would allow for simpler user interface?

@t-nelson
Copy link

"never go to sea with two chronometers. take one, or three"

@nk4rter
Copy link

nk4rter commented Jan 30, 2025

Shouldn't here be a >=?
https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L386
Or is this intentional?
Default value of unhealthy_threshold is 1, so we always skip the first one currently.

@t-nelson
Copy link

i think the logic is correct as is. threshold zero (no appetite) should alert the first one

@nk4rter
Copy link

nk4rter commented Feb 1, 2025

It looks like I'm almost done with the implementation...
A couple of questions:

  1. Let's say we have 3 endpoints: 1 is unreachable, 1 says transaction count is not advancing, 1 says that validator is delinquent. Do we report both alerts as separate datapoint_error! or, do we report a single datapoint_error! and concatenate the messages or something?
  2. Again 3 endpoints: 2 are unreachable, 1 says validator is delinquent. Do we report the datapoint_error! for watchtower-reliability alert and delinquency alert, or do we only report the watchtower-reliability ?
  3. If we got different failure reports from several nodes, but they are skipped by the config.unhealthy_threshold , do we still need to write every message in the info!("Failure {} of {}: log ? It would be convenient not to do that, and all the failures are logged as error! anyway.
  4. Should we report datapoint_info!("watchtower-sanity", ("ok", true, bool)); here as well? https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L398
    Because now when we filter the failures by the config.unhealthy_threshold, we stop saying that ok is true as well, even though we don't provide any datapoint_error!s yet.

@t-nelson
Copy link

t-nelson commented Feb 4, 2025

It looks like I'm almost done with the implementation... A couple of questions:

1. Let's say we have 3 endpoints: 1 is unreachable, 1 says transaction count is not advancing, 1 says that validator is delinquent. Do we report both alerts as separate `datapoint_error!` or, do we report a single `datapoint_error!` and concatenate the messages or something?

i think just raise watchtower reliability alert. we can't make any sense of the information we've collected

2. Again 3 endpoints: 2 are unreachable, 1 says validator is delinquent. Do we report the `datapoint_error!` for `watchtower-reliability` alert _and_ delinquency alert, or do we only report the `watchtower-reliability`  ?

just reliability

3. If we got different failure reports from several nodes, but they are skipped by the `config.unhealthy_threshold` , do we still need to write every message in the `info!("Failure {} of {}:` log ? It would be convenient not to do that, and all the failures are logged as `error!` anyway.

i think having the log messages will be useful when a watchtower reliability alert is raised

4. Should we report `datapoint_info!("watchtower-sanity", ("ok", true, bool));` here as well? https://github.com/anza-xyz/agave/blob/master/watchtower/src/main.rs#L398
   Because now when we filter the failures by the `config.unhealthy_threshold`, we stop saying that ok is true as well, even though we don't provide any `datapoint_error!`s yet.

i think you're thinking about this backwards. we want threshold endpoints to agree that our node and the cluster are healthy. we raise an alert any time that's not the case. attempting to get agreement on the failure mode will be very difficult to get right

@nk4rter
Copy link

nk4rter commented Feb 4, 2025

@t-nelson
Okay, I'll only raise the unreliability alert in cases 1 & 2.
In case 3 because of cases 1 & 2, it will be always guaranteed that we only have 1 failure message, but the other errors that were replaced by the unreliability alert will still be present in logs.
With regards to the question 4, I meant a different thing.
So we essentially have 3 branches here:

  1. >=min_agreeing_endpoints endpoints say that the cluster is healthy
  2. <min_agreeing_endpoints endpoints say that the cluster is healthy, but the alert was skipped for config.unhealthy_threshold times (which says how many consecutive failures must occur to trigger a notification).
  3. <min_agreeing_endpoints endpoints say that the cluster is healthy, and alert is not skipped, because it was triggered >config.unhealthy_threshold times.
    And what seemed weird to me is that in branch 1 we add a datapoint with ok=true, in branch 3 we add a datapoint with ok=false, but in branch 2 we simply don't add any datapoint at all. So from the point of view of an entity that looks at these datapoints it would appear that watchtower is no longer sending anything for however many iterations that branch will be triggered. And it looks like it would be logical to continue adding ok=true datapoint, until we hit the branch 3.

@nk4rter
Copy link

nk4rter commented Feb 4, 2025

I updated the pull request accordingly and moved it out of the draft state, although the CI still needs to be approved :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@t-nelson @willhickey @steviez @nk4rter and others