Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to avoid alert flaps.. #486

Merged
merged 8 commits into from
Nov 28, 2023
Merged

try to avoid alert flaps.. #486

merged 8 commits into from
Nov 28, 2023

Conversation

lionel-kaufmann-claranet
Copy link
Contributor

@lionel-kaufmann-claranet lionel-kaufmann-claranet commented Sep 13, 2023

Splunk otel agent use supervisor's RPC interface to collect status.
That's why in https://docs.splunk.com/Observability/gdi/monitors-hosts/supervisor.html for supervisor.state metric they directly link the gauge values to supervisor codes : http://supervisord.org/subprocess.html#process-states

Now consider a particular supervisor code 10 : Starting.

Stage 1: the PRODUCTION Way
"What ? oh ! this is legacy code, Well, it was developed by a guy who left company 12 years ago.... wait...who was the guy ? Did not remember...But no matter the guy !
Do do not worry ! We have solve stability issues by restarting all this process every hours..."
next ...

Stage 2 : the DEVEL Way
"What ? oh ! this is legacy code, with out-of-life libraries, with out-of-life design patterns, with npe, with home made libraries, with memory leaks and sometime a lock blocking condition...did i forget something ?
Do do not worry ! We have solve stability issues by restarting all this process every hours..."
next...

Stage 3: the BUDGET Way ( endly )
"What ? oh ! this is legacy code, there is an incredible tech debt there ! Refactoring and rebuild price is very high with no added value...Therefore the line in the budget approval list is top bottom.
Do do not worry ! We have solve stability issues by restarting all this process every hours..."

Frequency : ~ 24 times per day
It's a bit noisy to have major alert there...
^_^

By the way... all this noise for a RTFM :

  • stop code is 0
  • starting code is 10... this is definitely less than 20 !

Should we consider removing var.process_state_threshold_major ?

@lionel-kaufmann-claranet lionel-kaufmann-claranet marked this pull request as draft September 14, 2023 03:53
@lionel-kaufmann-claranet lionel-kaufmann-claranet added bug Something isn't working detectors About nex or existing detectors labels Sep 14, 2023
@lionel-kaufmann-claranet lionel-kaufmann-claranet changed the title try to avoid alert flaps when superisor auto restart process try to avoid alert flaps.. Sep 14, 2023
Copy link
Member

@pdecat pdecat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lionel-kaufmann-claranet lionel-kaufmann-claranet marked this pull request as ready for review November 20, 2023 12:28
@haedri haedri merged commit d5075e6 into master Nov 28, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detectors About nex or existing detectors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants