Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Elastalert to notify us when "Service are back online" #1919

Open
Purfakt opened this issue Sep 20, 2018 · 4 comments
Open

Using Elastalert to notify us when "Service are back online" #1919

Purfakt opened this issue Sep 20, 2018 · 4 comments

Comments

@Purfakt
Copy link

Purfakt commented Sep 20, 2018

I created a flatline type rule that sends an email when there is no more incoming message from a triplet that represent one of our service (this one works) :

name: Service Down
type: flatline

index: 'ourindex'

timestamp_field: 'ourtimestamp'
timestamp_type: unix_ms

query_key: ["s_serviceInfo-instanceId", "s_serviceInfo-replicaId", "s_serviceInfo-serviceName"]

realert:
  hours: 1

filter:
  - query_string:
      query: "s_qs_item-name: machine.cpu"

timeframe:
  minutes: 5

threshold: 1

alert:
  - email
from_addr: "alert@domain.com"
email: "me@domain.com"

alert_subject: "Service down on {0}"
#key represent the query key
alert_subject_args: 
  - "key"

alert_text: |
    Service not answering on {0} at {1}
alert_text_args: 
  - "key"
  - "ourtimestamp"

alert_text_type: alert_text_only

Now I need to create an alarm to notify us when this service is back online.
I thought I'd be smart with creating a new_term with term_window_size that matches the timeframe of the flatline alarm:

name: Service up

type: new_term

index: 'ourindex'

timestamp_field: 'ourtimestamp'
timestamp_type: unix_ms

fields: 
  - "s_serviceInfo-instanceId" 
  - "s_serviceInfo-replicaId" 
  - "s_serviceInfo-serviceName"

query_key: ["s_serviceInfo-instanceId", "s_serviceInfo-replicaId", "s_serviceInfo-serviceName"]

realert:
  hours: 1

filter:
  - query_string:
      query: "s_qs_item-name: machine.cpu"

terms_window_size:
  minutes: 5

window_step_size:
  minutes: 1

alert:
  - email
from_addr: "alert@domain.com"
email: "me@domain.com"

alert_subject: "Service up on {0}"
alert_subject_args: 
  - "s_serviceInfo-instanceId"

alert_text: |
    Service is now up and running on {0}, {1} at {2}
alert_text_args: 
  - "s_serviceInfo-instanceId"
  - "s_serviceInfo-replicaId"
  - "ourtimestamp"

alert_text_type: alert_text_only

Obviously, I'm either misunderstanding something or at least misusing it because the first alarm is working great on all services but the second one only triggers when it is first added to the rules folder. There is 0 match and the rule isn't silenced.

What am I doing wrong? Is there a less convoluted way to achieve this?

@Qmando
Copy link
Member

Qmando commented Sep 20, 2018

new_term is used for alerting only the first time a new value appears, so I don't think it's right for this purpose. Unfortunately there's not a nice mechanism to do this, but there is a slightly less convoluted way. You can create a flatline on the flatline itself matching.

Roughly, something like this:

type: flatline
index: elastalert_status # (may be different for you)
filter:
 - term:
      rule_name: "Service Down"
 - term:
      _type: elastalert # (not needed in ES 6)
forget_keys: true
timeframe:
  minutes: 70
threshold: 1

IE, "Alert if 'Service Down' hasn't alerted in at least 70 minutes". forget_keys will cause it to only alert once after the 'service down' alerts stop, until it happens again.

@Purfakt
Copy link
Author

Purfakt commented Sep 21, 2018

Thank you for the quick answer! I like the emphasis on the "slightly less convoluted". So I see how this would work but there is one big problem : See how I am using a triplet for the fields?

fields: 
  - "s_serviceInfo-instanceId" 
  - "s_serviceInfo-replicaId" 
  - "s_serviceInfo-serviceName"

While I was looking at the elastalert_status index, I found no way of telling which service is down or back up.
So what happen when several services are going down and only a part of them are back up? Or if a service A goes down, then back up, but B goes down, the alert won't be triggered as more "Service Down" alerts will be sent.

@Purfakt
Copy link
Author

Purfakt commented Sep 26, 2018

Hi,
Is there any update on this issue?
Thanks in advance!

@damioune123
Copy link

up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants