Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Missing monitoring data alert #78208

Merged

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Sep 22, 2020

Resolves #74823

This PR introduces a new out of the box alert for Stack Monitoring that identifies missing periods of monitoring data.

Copy

Firing message

We have not detected any monitoring data for 2 stack product(s) in cluster: abc123

Firing UI message

For the past 2m, we have not detected any monitoring data from the Kibana instance: kib-01, starting at September 23, 2020 3:07 PM EDT

Screenshots

Screen Shot 2020-09-23 at 2 50 41 PM

Screen Shot 2020-09-23 at 2 50 52 PM

Screen Shot 2020-09-23 at 2 50 59 PM

Screen Shot 2020-09-23 at 2 54 37 PM

@chrisronline chrisronline marked this pull request as ready for review September 24, 2020 19:22
@chrisronline chrisronline requested a review from a team September 24, 2020 19:22
@chrisronline chrisronline self-assigned this Sep 24, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

}
}

uniqueList[`${clusterUuid}::${stackProduct}::${stackProductUuid}`] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should only overwrite: if (differenceInMs > uniqueList[key]?.gapDuration) otherwise you might overwrite a big gap (that could've potentially trigger the alert) with a smaller one (that would not)

size: number
): Promise<AlertMissingData[]> {
const endMs = +new Date();
const startMs = endMs - limit - limit * 0.25; // Go a bit farther back because we need to detect the difference between seeing the monitoring data versus just not looking far enough back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea, since 25% is a pretty big padding for the one day default. How about we just have a minimum limit of 3 minutes? That way we can do something like: endMs - (limit + 180000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's probably fair. I guess I just wanted to be sure I accounted for various changes to the default collection period but 3m is probably a good enough distance to go back. Thanks!

const differenceInMs = +new Date() - uuidBucket.most_recent.value;
let stackProductName = stackProductUuid;
for (const nameField of nameFields) {
stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`);
stackProductName = get(uuidBucket, `document.hits.hits[0]._source.${nameField}`);

There was no top.* field name in my results, so I assume you wanted the above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you!

return {
instanceKey: `${missing.clusterUuid}:${missing.stackProduct}:${missing.stackProductUuid}`,
clusterUuid: missing.clusterUuid,
shouldFire: missing.gapDuration > duration && missing.gapDuration <= limit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need: ... && missing.gapDuration <= limit check, since your search query is already within the limit range (give or take). But, even if, wouldn't it still qualify as a valid trigger? Since, duration would always be less than limit

];

protected async fetchData(
params: CommonAlertParams,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also set the type as params: MissingDataParams here, and params: CommonAlertParams | unknown in base_alerts.ts That way you won't need to do any funky casting/recasting

Copy link
Contributor

@igoristic igoristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisronline

This is good effort, and I like the UI/UX feel of it, however I have some concerns/opinions:

  1. I think these should be separate alerts based on individual product, so we can set different thresholds and have the ability to enable/disable them for each specific product (es, beats, kibana, etc)

  2. Probably an oversight on our part, but the term: “Missing Data” is kinda confusing. It makes it seem as though there’s missing data in the production cluster. I think we should rename it to something like "Intermittent Monitoring" or Monitoring Collectors" alert and avoid the word “data”

  3. I can’t get it to trigger most of the time even though my gaps are bigger than the threshold. This is how I tested it:

  • First I made my threshold values more sensitive: gaps 1min, range 1 day
  • Then I did: "xpack.monitoring.collection.enabled": false via cluster settings (for about ~10min)
  • Waited for about 5min for the notification to show up on the Overview page
  • Then went to the node’s detail page and confirmed that the gaps are indeed there I did get it to trigger one time (the next day, but don’t know what I did differently)
  1. I feel like the code for these alerts (including cpu and disk usage alerts) are pretty bulky, even though a lot of the functionality/features are very similar (if not exactly the same). I'm only bringing this up, because I see us adding a lot of these alerts in the future (and don't know how well it will scale if each alert/pr is 1-2k lines of code and has some custom logic/flow). I sort of tried to address this in my "Disk Usage" pr: [Monitoring] Disk usage alerting #75419, but starting to feel like maybe the one size fits all approach is too good to be true. This might be just the nature of things, and I'm probably going off on a rant/tangent here. Would like to hear your opinion though

@igoristic
Copy link
Contributor

Just figured out why I wasn't getting any triggers
Screen Shot 2020-09-27 at 1 41 21 PM

Notice 1.601228437628E12 which is odd, since it should look something like 1601228946214. Looks like the timestamp is getting casted as a float (probably because of a precision/rounding error somewhere, maybe ES bug)

This is the query I used
GET .monitoring-*-7-*/_search?filter_path=aggregations.index.buckets,took
{
"size": 0,
"query": {
  "bool": {
    "filter": [
      {
        "terms": {
          "cluster_uuid": [
            "wuXG3QJKThmajyOWMx20hw"
          ]
        }
      },
      {
        "range": {
          "timestamp": {
            "gte": "now-1d"
          }
        }
      }
    ]
  }
},
"aggs": {
  "index": {
    "terms": {
      "field": "_index",
      "size": 10000
    },
    "aggs": {
      "clusters": {
        "terms": {
          "field": "cluster_uuid",
          "size": 10000
        },
        "aggs": {
          "es_uuids": {
            "terms": {
              "field": "node_stats.node_id",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          },
          "kibana_uuids": {
            "terms": {
              "field": "kibana_stats.kibana.uuid",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          },
          "beats": {
            "filter": {
              "bool": {
                "must_not": {
                  "term": {
                    "beats_stats.beat.type": "apm-server"
                  }
                }
              }
            },
            "aggs": {
              "beats_uuids": {
                "terms": {
                  "field": "beats_stats.beat.uuid",
                  "size": 10000
                },
                "aggs": {
                  "most_recent": {
                    "max": {
                      "field": "timestamp"
                    }
                  },
                  "document": {
                    "top_hits": {
                      "size": 1,
                      "sort": [
                        {
                          "timestamp": {
                            "order": "desc"
                          }
                        }
                      ],
                      "_source": {
                        "includes": [
                          "source_node.name",
                          "kibana_stats.kibana.name",
                          "logstash_stats.logstash.host",
                          "beats_stats.beat.name"
                        ]
                      }
                    }
                  }
                }
              }
            }
          },
          "apms": {
            "filter": {
              "bool": {
                "must": {
                  "term": {
                    "beats_stats.beat.type": "apm-server"
                  }
                }
              }
            },
            "aggs": {
              "apm_uuids": {
                "terms": {
                  "field": "beats_stats.beat.uuid",
                  "size": 10000
                },
                "aggs": {
                  "most_recent": {
                    "max": {
                      "field": "timestamp"
                    }
                  },
                  "document": {
                    "top_hits": {
                      "size": 1,
                      "sort": [
                        {
                          "timestamp": {
                            "order": "desc"
                          }
                        }
                      ],
                      "_source": {
                        "includes": [
                          "source_node.name",
                          "kibana_stats.kibana.name",
                          "logstash_stats.logstash.host",
                          "beats_stats.beat.name"
                        ]
                      }
                    }
                  }
                }
              }
            }
          },
          "logstash_uuids": {
            "terms": {
              "field": "logstash_stats.logstash.uuid",
              "size": 10000
            },
            "aggs": {
              "most_recent": {
                "max": {
                  "field": "timestamp"
                }
              },
              "document": {
                "top_hits": {
                  "size": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "source_node.name",
                      "kibana_stats.kibana.name",
                      "logstash_stats.logstash.host",
                      "beats_stats.beat.name"
                    ]
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
}

I tried it with epoch time range as well, and still got the same results

@chrisronline
Copy link
Contributor Author

I think these should be separate alerts based on individual product, so we can set different thresholds and have the ability to enable/disable them for each specific product (es, beats, kibana, etc)

@ravikesarwani Do you have any thoughts about this? Should we have a single alert for any missing monitoring data? or a separate alert for each product?

Probably an oversight on our part, but the term: “Missing Data” is kinda confusing. It makes it seem as though there’s missing data in the production cluster. I think we should rename it to something like "Intermittent Monitoring" or Monitoring Collectors" alert and avoid the word “data”

💯 I absolutely agree and it never crossed my mind once so thank you for pointing it out! I will update the label in the UI.

I can’t get it to trigger most of the time even though my gaps are bigger than the threshold. This is how I tested it:

Then I did: "xpack.monitoring.collection.enabled": false via cluster settings (for about ~10min)

I got this to work for me, so I'm not sure. Maybe do a screen recording?

Would like to hear your opinion though

I definitely agree and don't like the copy/paste for a new alert. My goal was to build a few of these alerts out to truly see what abstraction made sense - I think after 7.10, we can take a look at what we have and make a pass at some abstraction layer that will make it easier to build new alerts.

@igoristic
Copy link
Contributor

@chrisronline

Now that I understand we don't look for specific gaps in a range, but rather when the data stops for a specific time span. The limit setting ("Look this far back in time for any data") does not make any sense. I think we can remove it (from the UI) and use something like: limit = duration * 1.25. But, maybe I'm missing something?

@chrisronline
Copy link
Contributor Author

@igoristic Both levers feel important to me. I feel a user should be able to configure how long the alert should be left to fire until we basically give up. Perhaps we need to rename the parameter, but I don't feel we can define a threshold that applies to all users. I recall @ravikesarwani feeling both levers had value too.

@igoristic
Copy link
Contributor

@chrisronline I see your point, but the wording is actually what made me assume we're looking for gaps

I still feel like it's not really needed, and we're also giving a user more options that could potentially break the trigger (ex: limit < duration or limit > 7d can timeout a query)

@igoristic
Copy link
Contributor

@chrisronline Thanks for adding the changes! Looking a lot better 👍

Though, I still can't get it to trigger with a simple "xpack.monitoring.collection.enabled": false (via cluster settings). My threshold is at 5 min and I waited 10 min. I confirmed my nodes details page and the data did indeed stop

Also, my PR is now using getSafeForExternalLink to add current state (which can either be single or ccs) to the link:
8693ed7#diff-64a93554c0926988a7c616494814ac6bR62 This will break some of the links, after my PR is merged

@chrisronline
Copy link
Contributor Author

@igoristic Awesome find, I found the reason for your testing issues. Ready for another round!

Copy link
Contributor

@igoristic igoristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeyr! Working pretty good now. Awesome job! 🏆

Since you still need to fix the the Type check. Can you please also:

  • Remove all the globalState ?_g from links, since it's added in the UI now
  • And correct any translation ids that don't relate to the context: eg: ...missingData.ui.nextSteps.hotThreads

@chrisronline
Copy link
Contributor Author

Remove all the globalState ?_g from links, since it's added in the UI now

FWIW, this shouldn't apply to the action usage of global state in the URL because that is delivered through the notification provider (slack, email) and that needs to contain the contextual link

@chrisronline chrisronline merged commit a61f4d4 into elastic:master Oct 1, 2020
@chrisronline chrisronline deleted the monitoring/missing_data_alert branch October 1, 2020 16:28
chrisronline added a commit that referenced this pull request Oct 1, 2020
* WIP for alert

* Surface alert most places

* Fix up alert placement

* Fix tests

* Type fix

* Update copy

* Add alert presence to APM in the UI

* Fetch data a little differently

* We don't need moment

* Add tests

* PR feedback

* Update copy

* Fix up bug around grabbing old data

* PR feedback

* PR feedback

* Fix tests
# Conflicts:
#	x-pack/plugins/monitoring/public/components/apm/instance/instance.js
#	x-pack/plugins/monitoring/public/components/beats/beat/beat.js
@chrisronline
Copy link
Contributor Author

Backport:

7.x: 45c215d

phillipb added a commit to phillipb/kibana that referenced this pull request Oct 1, 2020
…aly-detection-partition-field

* 'master' of github.com:elastic/kibana: (76 commits)
  Fix z-index of KQL Suggestions dropdown (elastic#79184)
  [babel] remove unused/unneeded babel plugins (elastic#79173)
  [Search] Fix timeout upgrade link (elastic#79045)
  Always Show Embeddable Panel Header in Edit Mode (elastic#79152)
  [Ingest]: add more test for transform index (elastic#79154)
  [ML] DF Analytics: Collapsable sections on results pages (elastic#76641)
  [Fleet] Fix agent policy change action migration (elastic#79046)
  [Ingest Manager] Match package spec `dataset`->`data_stream` and `config_templates`->`policy_templates` renaming (elastic#78699)
  Revert "[Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875)"
  [ML] Update transform cloning to include description and new fields (elastic#78364)
  chore(NA): remove non existing plugin paths from case api integration tests (elastic#79127)
  [Ingest Manager] Ensure we trigger agent policy updated event when we bump revision. (elastic#78836)
  [Metrics UI] Display No Data context.values as [NO DATA] (elastic#78038)
  [Monitoring] Missing data alert (elastic#78208)
  [Lens] Fix embeddable title and description for reporting and dashboard tooltip (elastic#78767)
  [Lens] Consistent Drag and Drop styles (elastic#78674)
  [ML] Model management UI fixes and enhancements (elastic#79072)
  [Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875)
  [Security Solution]Fix basepath used by endpoint telemetry tests (elastic#79027)
  update rum agent version which contains longtasks (elastic#79105)
  ...
@chrisronline chrisronline changed the title [Monitoring] Missing data alert [Monitoring] Missing monitoring data alert Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Monitoring][Additional-Alerting] Missing Data / Gaps
4 participants