-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Unreachable alerts during upgrades #858
Comments
Please see if you can make it a configurable delay as some platforms can take a lot longer than others to get things done. |
Unfortunately, startup time of an Agent is correlated with the amount of nodes and data retention. I know that the Agent team has been working hard on reducing this, and I think a delay is a reasonable short term solution. A possible future approach is if the Agent could actively let Cloud know that it is going down for a restart or explicit shutdown. This allows Cloud to distinguish this from unexpected disconnects and have different notification behavior. Also, ephemeral nodes should probably not yield notifications in most cases. |
Yes, and it is correlated with the system resources (e.g. slow storage can significantly delay parent instances with a lot of historical data). |
ok, probably we need to see if we need to tackle this in two steps:
to try to access the urgency of this, I don't think I've seen this being reported very often - in Discord it was @luisj1983 and another user |
@hugovalente-pm People probably aren't bothering to report it, IMO IssuesI think that there are two distinct but related issues here.
I think that the agent-reachability issue is a sub-issue of the noisy-alarms because both are predicated on the way Netdata cloud or agent handles alerts related to agent-actions (I include packaging in that, fair or not). PriorityI'd say this is important and non-urgent. Netdata works well to make infra and apps more visible with less noise; having activities of the agent contribute to the noise contravenes that axiom. We are generating alerts which are completely foreseeable and avoidable in the scenario I'm talking about because they are generated by the management of the agent itself (whose packaging Netdata controls). Now, of course, that's why in Ops we have things like maintenance windows and alarm profiles etc but none of that is very dynamic. I see this as a good way to differentiate Netdata from other solutions too and would encourage customers to keep up-to-date. WorkaroundsI was looking briefly at dpkg-hooks to see if I could make some changes there on my test-system and then that could be a documented workaround.
Sorry for the long reply :-) |
@hugovalente-pm I still think that a good 0 step until 1, 2, etc discussed/implemented is increasing the timeout. |
@luisj1983 thanks for the detailed comment and I agree this is an important fix but not urgent (nothing is really breaking) but for sure we shouldn't spam users with alerts that are triggered due to agent updates the best solution seems to really be 2. and ensure between Cloud and Agent it is understood when an agent is supposed to go down. if nobody opposes we can increase it to 90 seconds, as you had suggested, @ilyam8 |
@hugovalente-pm I'm ok with the change, although we need to bear in mind it will delay all kind of reachability notifications, even the ones some users may want to get paged asap. |
We understand that increasing the timeout to 90 will increase the reachability notifications delay. |
I'm fine with a delay since I'd rather get alerts that were meaningful. One thing to note is that I'd strongly recommend that this is not a default but a timeout configurable in the netdata.conf. |
this is something that needs to be controlled from Cloud, it is Cloud that is identifying the unreachable status, so it would need to be something to be set per Space - which is some more effort than changing our current setting |
@hugovalente-pm OK but doesn't the agent have to tell the cloud "Hey I'm going sleepy-time now, don't go nuts and generate alerts?" If so then the agent can tell the cloud how long it's going down for (the configurable value), right? |
@luisj1983 : We are working on this feature to be able to configure reachability notifications (at space level). We also have this feature to identify agent upgrades, intentional restarts etc - so we treat them differently from standard reachability notifications. |
@sashwathn configurable reachability delay is released, can we close this issue? |
@luisj1983 : We have now introduced configurable timeouts for reachability notifications for the space and per room. Hope this helps. |
Bug description
A user has reported on discord (thread) that:
Seems to be related to auto-updates
Expected behavior
Unreachable alerts should have a delay that should consider most common time period for an agent to updates and restart before being fired.
The current time seems to be set to 30 seconds (TBC).
Steps to reproduce
Screenshots
No response
Error Logs
No response
Desktop
OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Browser Version [e.g. 22]
Additional context
No response
The text was updated successfully, but these errors were encountered: