Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groundwork for keepalives for all services #368

Merged
merged 8 commits into from
Feb 18, 2023
Merged

Conversation

jinnatar
Copy link
Contributor

This PR deprecates enabling or disabling individual handlers in favor of service level enablement, which in turn enables all handlers defined for that service.

More ground work for fixing #361. At this PR it's already possible to disable harvester monitoring but there are no other keepalives available yet. The next PR in the chain adds a new keepalive for the wallet as an example.

I'm torn on which services should be monitored by default. The conservative default would be no services, but that clashes with the previous default of running all handlers and the harvester keepalive. So I've opted to default to monitoring all services and thus all handlers, which also enables the harvester keepalive. Where it gets a bit wonky then is when we bring in a wallet keepalive. There's been no harm in having the incoming transaction handler enabled by default, but by comparison the peak handler expects to find a working wallet. If one isn't found, it will start triggering keepalive alerts of no wallet activity.

@jinnatar
Copy link
Contributor Author

The sneak peek of the wallet peak handler and wallet keepalive:
https://github.com/Artanicus/chiadog/compare/keepalive...Artanicus:chiadog:keepalive_wallet?expand=1

Copy link
Owner

@martomi martomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the default approach here. Few other comments / thoughts - PTAL.

src/default_config.yaml Outdated Show resolved Hide resolved
src/chia_log/log_handler.py Outdated Show resolved Hide resolved
src/notifier/keep_alive_monitor.py Show resolved Hide resolved
src/notifier/keep_alive_monitor.py Outdated Show resolved Hide resolved
src/notifier/keep_alive_monitor.py Outdated Show resolved Hide resolved
message = (
f"Your harvester appears to be offline! "
f"Your {service.name} appears to be offline! "
f"No events for the past {seconds_since_last} seconds."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should probably add some helpful hints, e.g.

No events for the past 320 seconds. You can adjust your config.yaml to either increase the notify threshold from {current threshold}, or disable the checks for this service by removing {service.name} from the list of monitored_services.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this verbosity should only be included in the logging.warning below and not the actual notification.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn’t hurt including in both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also been thinking that "offline" might not be the best term here. We could instead phrase it as "not healthy" since offline has a strong connotation that the service isn't running, when in reality in most cases it's is sort of online, just wedged somehow. :D

If it's rephrased as healthiness, then the messaging could be something like Your {service.name} is unhealthy and hasn't behaved as expected for the past {seconds_since_last} seconds. (This check can be adjusted.) .. which hopefully would lead anyone that is annoyed by it to go read documentation on how to not monitor that service, or to adjust the thresholds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated wording, LMK if that looks good.

src/default_config.yaml Show resolved Hide resolved
jinnatar added a commit to jinnatar/chiadog that referenced this pull request Feb 15, 2023
This is temporarily needed until the keepalive changes in martomi#368 go
through and make the keepalive system confuse aware.

Fixes martomi#369.
martomi pushed a commit that referenced this pull request Feb 15, 2023
This is temporarily needed until the keepalive changes in #368 go
through and make the keepalive system confuse aware.

Fixes #369.
This commit deprecates enabling or disabling individual handlers in
favor of service level enablement.
No change to the resulting image, just better caching of requirements
Now monitored services are read from the global config on init
Also drop handler sections that don't have any options. handler.enabled
options are obsolete and don't do anything.
Also trigger alert also for first miss of the lowest threshold. This
    means that a service already down when starting will be notified at
    T+threshold instead of T+threshold*2
Since monitoring is now configured at the service level and we're about
to have many more handlers for keepalives, this keeps INFO level logging
relevant and easier to understand.
Copy link
Owner

@martomi martomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of seemingly duplicate lines. Also I’d like us to increase test coverage of the keep alive monitor now that were doing more work on it. Can also be a followup PR.

Dockerfile Show resolved Hide resolved
# These thresholds determine how long a service can be unhealthy,
# before we trigger a high priority alert.
# The lowest value here determines how often the keep-alive checks run,
# and thus how often a notification for an unhealthy service will trigger!
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR: In future we might wanna decouple the frequency of notifications from the lowest value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. Have a default and a setting, not too hard to implement after the config revamp.

config-example.yaml Show resolved Hide resolved
src/notifier/keep_alive_monitor.py Outdated Show resolved Hide resolved
src/notifier/keep_alive_monitor.py Outdated Show resolved Hide resolved
@jinnatar
Copy link
Contributor Author

Current code coverage overall with most notifiers enabled (8 tests skipped and Grafana failing) is 78%. Not bad, not great. Keep-alive is at 87%. Biggest misses are:

  • Remote keep-alive isn't tested. I've previously used mbtest for these kinds of things but it's a really heavy testing dependency so will have a go at using VCR.py since it's way more self contained and faster.
  • We don't test what happens without a NotifyManager, probably should.
  • A couple of error conditions are not triggered, probably better to just exclude those from coverage.

Then there's actually testing novel situations, will have a think of those wile getting the pure coverage up first.

Copy link
Owner

@martomi martomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@martomi martomi merged commit 58c6fdf into martomi:main Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants