-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Missing monitoring data alert #78208
[Monitoring] Missing monitoring data alert #78208
Conversation
Pinging @elastic/stack-monitoring (Team:Monitoring) |
} | ||
} | ||
|
||
uniqueList[`${clusterUuid}::${stackProduct}::${stackProductUuid}`] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should only overwrite: if (differenceInMs > uniqueList[key]?.gapDuration)
otherwise you might overwrite a big gap (that could've potentially trigger the alert) with a smaller one (that would not)
size: number | ||
): Promise<AlertMissingData[]> { | ||
const endMs = +new Date(); | ||
const startMs = endMs - limit - limit * 0.25; // Go a bit farther back because we need to detect the difference between seeing the monitoring data versus just not looking far enough back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea, since 25% is a pretty big padding for the one day default. How about we just have a minimum limit of 3 minutes? That way we can do something like: endMs - (limit + 180000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's probably fair. I guess I just wanted to be sure I accounted for various changes to the default collection period but 3m
is probably a good enough distance to go back. Thanks!
const differenceInMs = +new Date() - uuidBucket.most_recent.value; | ||
let stackProductName = stackProductUuid; | ||
for (const nameField of nameFields) { | ||
stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stackProductName = get(uuidBucket, `top.hits.hits[0]._source.${nameField}`); | |
stackProductName = get(uuidBucket, `document.hits.hits[0]._source.${nameField}`); |
There was no top.*
field name in my results, so I assume you wanted the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you!
return { | ||
instanceKey: `${missing.clusterUuid}:${missing.stackProduct}:${missing.stackProductUuid}`, | ||
clusterUuid: missing.clusterUuid, | ||
shouldFire: missing.gapDuration > duration && missing.gapDuration <= limit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need: ... && missing.gapDuration <= limit
check, since your search query is already within the limit range (give or take). But, even if, wouldn't it still qualify as a valid trigger? Since, duration
would always be less than limit
]; | ||
|
||
protected async fetchData( | ||
params: CommonAlertParams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also set the type as params: MissingDataParams
here, and params: CommonAlertParams | unknown
in base_alerts.ts
That way you won't need to do any funky casting/recasting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good effort, and I like the UI/UX feel of it, however I have some concerns/opinions:
-
I think these should be separate alerts based on individual product, so we can set different thresholds and have the ability to enable/disable them for each specific product (es, beats, kibana, etc)
-
Probably an oversight on our part, but the term: “Missing Data” is kinda confusing. It makes it seem as though there’s missing data in the production cluster. I think we should rename it to something like "Intermittent Monitoring" or Monitoring Collectors" alert and avoid the word “data”
-
I can’t get it to trigger most of the time even though my gaps are bigger than the threshold. This is how I tested it:
- First I made my threshold values more sensitive: gaps 1min, range 1 day
- Then I did:
"xpack.monitoring.collection.enabled": false
via cluster settings (for about ~10min) - Waited for about 5min for the notification to show up on the Overview page
- Then went to the node’s detail page and confirmed that the gaps are indeed there I did get it to trigger one time (the next day, but don’t know what I did differently)
- I feel like the code for these alerts (including cpu and disk usage alerts) are pretty bulky, even though a lot of the functionality/features are very similar (if not exactly the same). I'm only bringing this up, because I see us adding a lot of these alerts in the future (and don't know how well it will scale if each alert/pr is 1-2k lines of code and has some custom logic/flow). I sort of tried to address this in my "Disk Usage" pr: [Monitoring] Disk usage alerting #75419, but starting to feel like maybe the one size fits all approach is too good to be true. This might be just the nature of things, and I'm probably going off on a rant/tangent here. Would like to hear your opinion though
@ravikesarwani Do you have any thoughts about this? Should we have a single alert for any missing monitoring data? or a separate alert for each product?
💯 I absolutely agree and it never crossed my mind once so thank you for pointing it out! I will update the label in the UI.
I got this to work for me, so I'm not sure. Maybe do a screen recording?
I definitely agree and don't like the copy/paste for a new alert. My goal was to build a few of these alerts out to truly see what abstraction made sense - I think after 7.10, we can take a look at what we have and make a pass at some abstraction layer that will make it easier to build new alerts. |
Now that I understand we don't look for specific gaps in a range, but rather when the data stops for a specific time span. The |
@igoristic Both levers feel important to me. I feel a user should be able to configure how long the alert should be left to fire until we basically give up. Perhaps we need to rename the parameter, but I don't feel we can define a threshold that applies to all users. I recall @ravikesarwani feeling both levers had value too. |
@chrisronline I see your point, but the wording is actually what made me assume we're looking for gaps I still feel like it's not really needed, and we're also giving a user more options that could potentially break the trigger (ex: |
@chrisronline Thanks for adding the changes! Looking a lot better 👍 Though, I still can't get it to trigger with a simple Also, my PR is now using |
@igoristic Awesome find, I found the reason for your testing issues. Ready for another round! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeyr! Working pretty good now. Awesome job! 🏆
Since you still need to fix the the Type check. Can you please also:
- Remove all the globalState
?_g
from links, since it's added in the UI now - And correct any translation ids that don't relate to the context: eg:
...missingData.ui.nextSteps.hotThreads
FWIW, this shouldn't apply to the |
* WIP for alert * Surface alert most places * Fix up alert placement * Fix tests * Type fix * Update copy * Add alert presence to APM in the UI * Fetch data a little differently * We don't need moment * Add tests * PR feedback * Update copy * Fix up bug around grabbing old data * PR feedback * PR feedback * Fix tests # Conflicts: # x-pack/plugins/monitoring/public/components/apm/instance/instance.js # x-pack/plugins/monitoring/public/components/beats/beat/beat.js
Backport: 7.x: 45c215d |
…aly-detection-partition-field * 'master' of github.com:elastic/kibana: (76 commits) Fix z-index of KQL Suggestions dropdown (elastic#79184) [babel] remove unused/unneeded babel plugins (elastic#79173) [Search] Fix timeout upgrade link (elastic#79045) Always Show Embeddable Panel Header in Edit Mode (elastic#79152) [Ingest]: add more test for transform index (elastic#79154) [ML] DF Analytics: Collapsable sections on results pages (elastic#76641) [Fleet] Fix agent policy change action migration (elastic#79046) [Ingest Manager] Match package spec `dataset`->`data_stream` and `config_templates`->`policy_templates` renaming (elastic#78699) Revert "[Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875)" [ML] Update transform cloning to include description and new fields (elastic#78364) chore(NA): remove non existing plugin paths from case api integration tests (elastic#79127) [Ingest Manager] Ensure we trigger agent policy updated event when we bump revision. (elastic#78836) [Metrics UI] Display No Data context.values as [NO DATA] (elastic#78038) [Monitoring] Missing data alert (elastic#78208) [Lens] Fix embeddable title and description for reporting and dashboard tooltip (elastic#78767) [Lens] Consistent Drag and Drop styles (elastic#78674) [ML] Model management UI fixes and enhancements (elastic#79072) [Metrics UI] Add ability to override datafeeds and job config for partition field (elastic#78875) [Security Solution]Fix basepath used by endpoint telemetry tests (elastic#79027) update rum agent version which contains longtasks (elastic#79105) ...
Resolves #74823
This PR introduces a new out of the box alert for Stack Monitoring that identifies missing periods of monitoring data.
Copy
Firing message
Firing UI message
Screenshots