-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Groundwork for keepalives for all services #368
Conversation
The sneak peek of the wallet peak handler and wallet keepalive: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the default approach here. Few other comments / thoughts - PTAL.
src/notifier/keep_alive_monitor.py
Outdated
message = ( | ||
f"Your harvester appears to be offline! " | ||
f"Your {service.name} appears to be offline! " | ||
f"No events for the past {seconds_since_last} seconds." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should probably add some helpful hints, e.g.
No events for the past 320 seconds. You can adjust your config.yaml to either increase the notify threshold from {current threshold}, or disable the checks for this service by removing {service.name} from the list of
monitored_services
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe this verbosity should only be included in the logging.warning
below and not the actual notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn’t hurt including in both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also been thinking that "offline" might not be the best term here. We could instead phrase it as "not healthy" since offline has a strong connotation that the service isn't running, when in reality in most cases it's is sort of online, just wedged somehow. :D
If it's rephrased as healthiness, then the messaging could be something like Your {service.name} is unhealthy and hasn't behaved as expected for the past {seconds_since_last} seconds. (This check can be adjusted.)
.. which hopefully would lead anyone that is annoyed by it to go read documentation on how to not monitor that service, or to adjust the thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated wording, LMK if that looks good.
This is temporarily needed until the keepalive changes in martomi#368 go through and make the keepalive system confuse aware. Fixes martomi#369.
This commit deprecates enabling or disabling individual handlers in favor of service level enablement.
No change to the resulting image, just better caching of requirements
Now monitored services are read from the global config on init
Also drop handler sections that don't have any options. handler.enabled options are obsolete and don't do anything.
Also trigger alert also for first miss of the lowest threshold. This means that a service already down when starting will be notified at T+threshold instead of T+threshold*2
Since monitoring is now configured at the service level and we're about to have many more handlers for keepalives, this keeps INFO level logging relevant and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple of seemingly duplicate lines. Also I’d like us to increase test coverage of the keep alive monitor now that were doing more work on it. Can also be a followup PR.
# These thresholds determine how long a service can be unhealthy, | ||
# before we trigger a high priority alert. | ||
# The lowest value here determines how often the keep-alive checks run, | ||
# and thus how often a notification for an unhealthy service will trigger! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR: In future we might wanna decouple the frequency of notifications from the lowest value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Have a default and a setting, not too hard to implement after the config revamp.
Current code coverage overall with most notifiers enabled (8 tests skipped and Grafana failing) is 78%. Not bad, not great. Keep-alive is at 87%. Biggest misses are:
Then there's actually testing novel situations, will have a think of those wile getting the pure coverage up first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
This PR deprecates enabling or disabling individual handlers in favor of service level enablement, which in turn enables all handlers defined for that service.
More ground work for fixing #361. At this PR it's already possible to disable harvester monitoring but there are no other keepalives available yet. The next PR in the chain adds a new keepalive for the wallet as an example.
I'm torn on which services should be monitored by default. The conservative default would be no services, but that clashes with the previous default of running all handlers and the harvester keepalive. So I've opted to default to monitoring all services and thus all handlers, which also enables the harvester keepalive. Where it gets a bit wonky then is when we bring in a wallet keepalive. There's been no harm in having the incoming transaction handler enabled by default, but by comparison the peak handler expects to find a working wallet. If one isn't found, it will start triggering keepalive alerts of no wallet activity.