Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike alert not using specified timeframe for querying event counts #3041

Open
piehld opened this issue Nov 16, 2020 · 5 comments
Open

Spike alert not using specified timeframe for querying event counts #3041

piehld opened this issue Nov 16, 2020 · 5 comments

Comments

@piehld
Copy link

piehld commented Nov 16, 2020

When running a spike rule with the following configuration (and verbose stdout), the time window over which elastalert is counting events appears to be that of the run_every time interval as opposed to the timeframe interval.

Global configuration: config.yaml

rules_folder: /home/ubuntu/elastalert/rules
run_every:
  minutes: 5
buffer_time:
  hours: 1
es_host: localhost
es_port: 9200
use_ssl: False
verify_certs: True
writeback_index: elastalert
writeback_alias: elastalert_alerts
alert_time_limit:
  days: 2

Spike rule configuration: web-traffic-spike.yaml

name: "Top-level LB stability - Dramatic change in web traffic"
es_host: my-elasticsearch-host.domain.org
es_port: 9200
index: my-index-*
type: spike
buffer_time:
  hours: 3
timeframe:
  hours: 3
spike_height: 2.5
spike_type: both
threshold_ref: 75000
# threshold_cur:
query_key: service.name
alert_on_new_data: False
use_count_query: True
doc_type: _doc
realert:
  hours: 3
alert:
- "email"
email:
- "my_email@domain.org"
from_addr: "it-services@domain.org"

After allowing this to run for 3 days (thus clearly surpassing the initial timeframe duration), my stdout is still reporting as follows:

INFO:elastalert:Background configuration change check run at 2020-11-16 07:21 PST
INFO:elastalert:Background alerts thread 0 pending alerts sent at 2020-11-16 07:21 PST
INFO:elastalert:Queried rule Top-level LB stability - Dramatic change in web traffic from 2020-11-16 07:16 PST to 2020-11-16 07:21 PST: 5661 hits
INFO:elastalert:Ran Top-level LB stability - Dramatic change in web traffic from 2020-11-16 07:16 PST to 2020-11-16 07:21 PST: 5661 query hits (0 already seen), 0 matches, 0 alerts sent
INFO:elastalert:Disabled rules are: []
INFO:elastalert:Sleeping for 299.999776 seconds

As you can see, the queries are only being performed over 5 min intervals (as defined for run_every), not 3 hours as specified for timeframe. Any idea on why this is happening?

@nsano-rururu
Copy link
Contributor

We also recommend asking questions on the Gitter channel
https://gitter.im/Yelp/elastalert

@piehld
Copy link
Author

piehld commented Nov 17, 2020

@nsano-rururu Thank you, I posted it there as well. However, after investigating this myself, I believe this actually is indicative of a bug in the verbose output.

It seems that the verbose output is reporting the count from the run_every time window, while the alert itself is also checking the count from the timeframe window but simply not printing that to the standard output. I noticed this because I ended up receiving an alert that explicitly stated the full time window, for which the count was accurate:

An abnormal number (921000) of events occurred around 2020-11-16 15:53 PST.
Preceding that time, there were only 366000 events within 3:00:00

So I guess there are two bugs:
(1) The program shouldn't be wasting time querying the run_every time window anyway, and
(2) The count from the timeframe window query should be what is printed to the verbose standard output

@nsano-rururu
Copy link
Contributor

You have to choose between continuing to use it as it is or fixing the bug yourself.

@piehld
Copy link
Author

piehld commented Nov 18, 2020

OK I understand. It's not a major bug as it does not appear to affect the actual alert, just the verbose output. So it's definitely still usable as is. I will try to look into fixing it as I continue setting up new alerts for our systems.

Thank you for your response and all your support of this project.

@findmyname666
Copy link

findmyname666 commented Feb 25, 2021

Hello all,

I did a few tests and it really seems that rules are executed based on run_every configuration and rule timeframe isn't respected.

The following example is for terms query.

  • We can see from and to dates in logs for each query - link.
  • The same start and end time is used a few lines above for reconstruction of elastic query - link.

It means that these dates should be used for query itself.
I run elastalert with argument --es_debug_trace to see executed queries.
There are really used these dates, probably calculated from run_every settings.
FYI - run_every is set to 1 minute in this case.

  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gt": "2021-02-25T16:35:30.857161Z",
                  "lte": "2021-02-25T16:35:34.273806Z"
                }
              }
            },

I see that there exists function called set_starttime and it seems that it takes into validation timeframe or buffer_time but only when rule don't have option use_count_query or use_terms_query - link. If the rule have one of these options the start time is set:

  1. Based on run_every configuration option when scan_entire_timeframe isn't set in rule.
  2. Based on timeframe interval if scan_entire_timeframe is set in rule.

Therefore the solution is to set scan_entire_timeframe in rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants