Groundwork for keepalives for all services #368

jinnatar · 2023-02-12T15:53:47Z

This PR deprecates enabling or disabling individual handlers in favor of service level enablement, which in turn enables all handlers defined for that service.

More ground work for fixing #361. At this PR it's already possible to disable harvester monitoring but there are no other keepalives available yet. The next PR in the chain adds a new keepalive for the wallet as an example.

I'm torn on which services should be monitored by default. The conservative default would be no services, but that clashes with the previous default of running all handlers and the harvester keepalive. So I've opted to default to monitoring all services and thus all handlers, which also enables the harvester keepalive. Where it gets a bit wonky then is when we bring in a wallet keepalive. There's been no harm in having the incoming transaction handler enabled by default, but by comparison the peak handler expects to find a working wallet. If one isn't found, it will start triggering keepalive alerts of no wallet activity.

jinnatar · 2023-02-12T15:56:01Z

The sneak peek of the wallet peak handler and wallet keepalive:
https://github.com/Artanicus/chiadog/compare/keepalive...Artanicus:chiadog:keepalive_wallet?expand=1

martomi

Agreed with the default approach here. Few other comments / thoughts - PTAL.

src/default_config.yaml

src/chia_log/log_handler.py

src/notifier/keep_alive_monitor.py

martomi · 2023-02-13T21:13:28Z

src/notifier/keep_alive_monitor.py

                    message = (
-                        f"Your harvester appears to be offline! "
+                        f"Your {service.name} appears to be offline! "
                        f"No events for the past {seconds_since_last} seconds."


Here we should probably add some helpful hints, e.g.

No events for the past 320 seconds. You can adjust your config.yaml to either increase the notify threshold from {current threshold}, or disable the checks for this service by removing {service.name} from the list of monitored_services.

But maybe this verbosity should only be included in the logging.warning below and not the actual notification.

I think it doesn’t hurt including in both.

I've also been thinking that "offline" might not be the best term here. We could instead phrase it as "not healthy" since offline has a strong connotation that the service isn't running, when in reality in most cases it's is sort of online, just wedged somehow. :D

If it's rephrased as healthiness, then the messaging could be something like Your {service.name} is unhealthy and hasn't behaved as expected for the past {seconds_since_last} seconds. (This check can be adjusted.) .. which hopefully would lead anyone that is annoyed by it to go read documentation on how to not monitor that service, or to adjust the thresholds.

Updated wording, LMK if that looks good.

src/default_config.yaml

This is temporarily needed until the keepalive changes in martomi#368 go through and make the keepalive system confuse aware. Fixes martomi#369.

This is temporarily needed until the keepalive changes in #368 go through and make the keepalive system confuse aware. Fixes #369.

This commit deprecates enabling or disabling individual handlers in favor of service level enablement.

No change to the resulting image, just better caching of requirements

Now monitored services are read from the global config on init

Also drop handler sections that don't have any options. handler.enabled options are obsolete and don't do anything.

Also trigger alert also for first miss of the lowest threshold. This means that a service already down when starting will be notified at T+threshold instead of T+threshold*2

Since monitoring is now configured at the service level and we're about to have many more handlers for keepalives, this keeps INFO level logging relevant and easier to understand.

martomi

Looks good, just a couple of seemingly duplicate lines. Also I’d like us to increase test coverage of the keep alive monitor now that were doing more work on it. Can also be a followup PR.

Dockerfile

martomi · 2023-02-18T00:30:34Z

config-example.yaml

+  # These thresholds determine how long a service can be unhealthy,
+  # before we trigger a high priority alert.
+  # The lowest value here determines how often the keep-alive checks run,
+  # and thus how often a notification for an unhealthy service will trigger!


Not for this PR: In future we might wanna decouple the frequency of notifications from the lowest value.

Yeah, that makes sense. Have a default and a setting, not too hard to implement after the config revamp.

config-example.yaml

src/notifier/keep_alive_monitor.py

jinnatar · 2023-02-18T13:28:23Z

Current code coverage overall with most notifiers enabled (8 tests skipped and Grafana failing) is 78%. Not bad, not great. Keep-alive is at 87%. Biggest misses are:

Remote keep-alive isn't tested. I've previously used mbtest for these kinds of things but it's a really heavy testing dependency so will have a go at using VCR.py since it's way more self contained and faster.
We don't test what happens without a NotifyManager, probably should.
A couple of error conditions are not triggered, probably better to just exclude those from coverage.

Then there's actually testing novel situations, will have a think of those wile getting the pure coverage up first.

martomi

🚀

martomi requested changes Feb 13, 2023

View reviewed changes

jinnatar added a commit to jinnatar/chiadog that referenced this pull request Feb 15, 2023

Pass the keepalive config as a dict

b79e85f

This is temporarily needed until the keepalive changes in martomi#368 go through and make the keepalive system confuse aware. Fixes martomi#369.

jinnatar mentioned this pull request Feb 15, 2023

Pass the keepalive config as a dict #370

Merged

martomi pushed a commit that referenced this pull request Feb 15, 2023

Pass the keepalive config as a dict (#370)

9480150

This is temporarily needed until the keepalive changes in #368 go through and make the keepalive system confuse aware. Fixes #369.

jinnatar added 3 commits February 17, 2023 13:31

Groundwork for keepalives for all services

471f372

This commit deprecates enabling or disabling individual handlers in favor of service level enablement.

Reorder Dockerfile a bit for faster dev builds

ee5f1c3

No change to the resulting image, just better caching of requirements

Refactor keep_alive_monitor service init

fb8ce2d

Now monitored services are read from the global config on init

jinnatar force-pushed the keepalive branch from 584b6d8 to 471f372 Compare February 17, 2023 13:23

jinnatar added 4 commits February 17, 2023 17:08

Add monitored_services example

b7006f7

Also drop handler sections that don't have any options. handler.enabled options are obsolete and don't do anything.

Update config-example.yaml for keep-alive threshold config

5f930ea

Make keep-alive alerts and logging more helpful

24f12ab

Also trigger alert also for first miss of the lowest threshold. This means that a service already down when starting will be notified at T+threshold instead of T+threshold*2

De-emphasize handler and parser init logging to DEBUG

4eb215d

Since monitoring is now configured at the service level and we're about to have many more handlers for keepalives, this keeps INFO level logging relevant and easier to understand.

martomi requested changes Feb 18, 2023

View reviewed changes

Eliminate some duplication left over from refactoring

f61be52

martomi approved these changes Feb 18, 2023

View reviewed changes

martomi merged commit 58c6fdf into martomi:main Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groundwork for keepalives for all services #368

Groundwork for keepalives for all services #368

jinnatar commented Feb 12, 2023

jinnatar commented Feb 12, 2023

martomi left a comment

martomi Feb 13, 2023

martomi Feb 13, 2023

martomi Feb 13, 2023

jinnatar Feb 17, 2023

jinnatar Feb 17, 2023

martomi left a comment

martomi Feb 18, 2023

jinnatar Feb 18, 2023

jinnatar commented Feb 18, 2023

martomi left a comment

Groundwork for keepalives for all services #368

Groundwork for keepalives for all services #368

Conversation

jinnatar commented Feb 12, 2023

jinnatar commented Feb 12, 2023

martomi left a comment

Choose a reason for hiding this comment

martomi Feb 13, 2023

Choose a reason for hiding this comment

martomi Feb 13, 2023

Choose a reason for hiding this comment

martomi Feb 13, 2023

Choose a reason for hiding this comment

jinnatar Feb 17, 2023

Choose a reason for hiding this comment

jinnatar Feb 17, 2023

Choose a reason for hiding this comment

martomi left a comment

Choose a reason for hiding this comment

martomi Feb 18, 2023

Choose a reason for hiding this comment

jinnatar Feb 18, 2023

Choose a reason for hiding this comment

jinnatar commented Feb 18, 2023

martomi left a comment

Choose a reason for hiding this comment