[Bug]: Unreachable alerts during upgrades #858

hugovalente-pm · 2023-06-22T17:03:10Z

Bug description

A user has reported on discord (thread) that:

I get frequently false unreachable alerts about servers which are not offline.
After 15 minutes the alert is gone without any intervention. It happens once a day and sometimes there are several days between those alerts. Anyone else with the same problem?

Seems to be related to auto-updates

most of the alerts are at the same time 7 am, 15 am ...

Expected behavior

Unreachable alerts should have a delay that should consider most common time period for an agent to updates and restart before being fired.
The current time seems to be set to 30 seconds (TBC).

Steps to reproduce

Screenshots

No response

Error Logs

No response

Desktop

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]

Additional context

No response

hugovalente-pm · 2023-06-22T17:05:41Z

@netdata/cloud-be we will need to review this delay but need to agree on the suitable time to set it, there is currently a suggestion from @ilyam8 to set this up to 90 seconds.
cc/ @ralphm

luisj1983 · 2023-06-22T17:19:16Z

Please see if you can make it a configurable delay as some platforms can take a lot longer than others to get things done.
Ideally a v2 iteration would be to hook into some sort of netdata agent health status so that we know when the agent really is back and ready to go :-)

ralphm · 2023-06-22T17:20:52Z

Unfortunately, startup time of an Agent is correlated with the amount of nodes and data retention. I know that the Agent team has been working hard on reducing this, and I think a delay is a reasonable short term solution.

A possible future approach is if the Agent could actively let Cloud know that it is going down for a restart or explicit shutdown. This allows Cloud to distinguish this from unexpected disconnects and have different notification behavior.

Also, ephemeral nodes should probably not yield notifications in most cases.

ilyam8 · 2023-06-22T17:29:17Z

Yes, and it is correlated with the system resources (e.g. slow storage can significantly delay parent instances with a lot of historical data).

hugovalente-pm · 2023-06-23T12:17:44Z

ok, probably we need to see if we need to tackle this in two steps:

More immediate solution: suggestion adding a unreachable timeout configuration per Space (?)
Have proper messaging from Agent to Cloud saying that it is going down for a restart

to try to access the urgency of this, I don't think I've seen this being reported very often - in Discord it was @luisj1983 and another user

luisj1983 · 2023-06-23T14:01:18Z

@hugovalente-pm People probably aren't bothering to report it, IMO
After all, you can reproduce the behaviour any time you upgrade an agent and so it must be happening.

Issues

I think that there are two distinct but related issues here.

The agent-reachability issue
This occurs when the Netdata Cloud thinks that the agent/node is unreachable.
This is the issue reported by the chap on Discord.
The noisy-alarms issue
This occurs when due to some foreseeable action on the agent we get a flurry of spurious alarms. Foreseeable scenarios would be things like agent upgrades and, in future, any maintenance actions the agent takes (e.g. db house-keeping) which are reported to cause unnecessary alerts.

I think that the agent-reachability issue is a sub-issue of the noisy-alarms because both are predicated on the way Netdata cloud or agent handles alerts related to agent-actions (I include packaging in that, fair or not).

Priority

I'd say this is important and non-urgent.
That's because it's not exactly breaking anything (although the unnecessary nature of the alerts-noise could be debated) and thus non-urgent; but important precisely because it introduces noise into the alerts.
It's also non-urgent because there are workarounds such as maintenance windows, silencing alarms etc.

Netdata works well to make infra and apps more visible with less noise; having activities of the agent contribute to the noise contravenes that axiom.

We are generating alerts which are completely foreseeable and avoidable in the scenario I'm talking about because they are generated by the management of the agent itself (whose packaging Netdata controls).

Now, of course, that's why in Ops we have things like maintenance windows and alarm profiles etc but none of that is very dynamic. I see this as a good way to differentiate Netdata from other solutions too and would encourage customers to keep up-to-date.

Workarounds

I was looking briefly at dpkg-hooks to see if I could make some changes there on my test-system and then that could be a documented workaround.
I'm also going to start work in the next 1-2 weeks on an ansible role to handle maintenance scenarios.
However, the problem as I see it is that, as far as I know, the Netdata agent has no queryable concept of being ready. So the problem is the result of having to use arbitrary tokens of readiness...

For example, the agent itself implicitly does this because as soon as you start the service it is ready in the sense that if there is anything amiss you'll start getting alerts straight away-ish.
Alternatively, we can talk about having cool-down timers, like already suggested.
The problem with that is that there is no correct amount of time to wait. It's a reasonable workaround but given the amount of metrics the Netdata agent collects about it's own internals, I think the gold-standard would be to imagine the Netdata metrics as one collection and all the other metrics as another collection and somehow make the alerting on the latter dependent upon the former. A sort of agent-health concept.

Sorry for the long reply :-)

ilyam8 · 2023-06-27T09:31:07Z

@hugovalente-pm I still think that a good 0 step until 1, 2, etc discussed/implemented is increasing the timeout.

hugovalente-pm · 2023-06-27T13:54:31Z

@luisj1983 thanks for the detailed comment and I agree this is an important fix but not urgent (nothing is really breaking) but for sure we shouldn't spam users with alerts that are triggered due to agent updates

the best solution seems to really be 2. and ensure between Cloud and Agent it is understood when an agent is supposed to go down.

if nobody opposes we can increase it to 90 seconds, as you had suggested, @ilyam8
@car12o @ralphm any concerns?

car12o · 2023-06-27T14:08:35Z

@hugovalente-pm I'm ok with the change, although we need to bear in mind it will delay all kind of reachability notifications, even the ones some users may want to get paged asap.

ilyam8 · 2023-06-27T14:48:14Z

We understand that increasing the timeout to 90 will increase the reachability notifications delay.

luisj1983 · 2023-06-27T15:08:27Z

I'm fine with a delay since I'd rather get alerts that were meaningful.
If this is a delay added to the agent startup then it's potentially quite useful too since we know what happens when you restart a server- you get lots of alerts because things may still be spinning up.
What I would say is that I definitely wouldn't want a delay to the monitoring itself, as that's crucial to have data especially on startup.

One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.

hugovalente-pm · 2023-06-29T08:36:39Z

One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf.

this is something that needs to be controlled from Cloud, it is Cloud that is identifying the unreachable status, so it would need to be something to be set per Space - which is some more effort than changing our current setting

luisj1983 · 2023-06-29T15:55:34Z

@hugovalente-pm OK but doesn't the agent have to tell the cloud "Hey I'm going sleepy-time now, don't go nuts and generate alerts?" If so then the agent can tell the cloud how long it's going down for (the configurable value), right?
Not saying it has to be in the first iteration ofc :-)

sashwathn · 2024-09-11T12:23:30Z

@luisj1983 : We are working on this feature to be able to configure reachability notifications (at space level). We also have this feature to identify agent upgrades, intentional restarts etc - so we treat them differently from standard reachability notifications.

cc: @car12o @stelfrag

car12o · 2024-10-08T07:55:45Z

@sashwathn configurable reachability delay is released, can we close this issue?

sashwathn · 2024-10-18T09:57:26Z

@luisj1983 : We have now introduced configurable timeouts for reachability notifications for the space and per room.
You can access this under Space Settings --> Alerts and Notifications --> Reachability.

Hope this helps.

hugovalente-pm added bug Something isn't working needs triage labels Jun 22, 2023

hugovalente-pm added cloud-backend area/alert-notifications priority/medium and removed needs triage labels Jun 22, 2023

dogsbody-josh mentioned this issue Aug 14, 2024

[Feat]: Cloud based Reachability Alerts - user configurable functionality improvements #1039

Closed

sashwathn closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Unreachable alerts during upgrades #858

[Bug]: Unreachable alerts during upgrades #858

hugovalente-pm commented Jun 22, 2023 •

edited

Loading

hugovalente-pm commented Jun 22, 2023

luisj1983 commented Jun 22, 2023

ralphm commented Jun 22, 2023

ilyam8 commented Jun 22, 2023

hugovalente-pm commented Jun 23, 2023

luisj1983 commented Jun 23, 2023

ilyam8 commented Jun 27, 2023

hugovalente-pm commented Jun 27, 2023

car12o commented Jun 27, 2023

ilyam8 commented Jun 27, 2023

luisj1983 commented Jun 27, 2023 •

edited

Loading

hugovalente-pm commented Jun 29, 2023

luisj1983 commented Jun 29, 2023

sashwathn commented Sep 11, 2024

car12o commented Oct 8, 2024

sashwathn commented Oct 18, 2024

[Bug]: Unreachable alerts during upgrades #858

[Bug]: Unreachable alerts during upgrades #858

Comments

hugovalente-pm commented Jun 22, 2023 • edited Loading

Bug description

Expected behavior

Steps to reproduce

Screenshots

Error Logs

Desktop

Additional context

hugovalente-pm commented Jun 22, 2023

luisj1983 commented Jun 22, 2023

ralphm commented Jun 22, 2023

ilyam8 commented Jun 22, 2023

hugovalente-pm commented Jun 23, 2023

luisj1983 commented Jun 23, 2023

Issues

Priority

Workarounds

ilyam8 commented Jun 27, 2023

hugovalente-pm commented Jun 27, 2023

car12o commented Jun 27, 2023

ilyam8 commented Jun 27, 2023

luisj1983 commented Jun 27, 2023 • edited Loading

hugovalente-pm commented Jun 29, 2023

luisj1983 commented Jun 29, 2023

sashwathn commented Sep 11, 2024

car12o commented Oct 8, 2024

sashwathn commented Oct 18, 2024

hugovalente-pm commented Jun 22, 2023 •

edited

Loading

luisj1983 commented Jun 27, 2023 •

edited

Loading