-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Disk usage alerting #75419
Conversation
Pinging @elastic/stack-monitoring (Team:Monitoring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, nice work so far!
Found a couple of things right away that's making it hard to continue to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another comment about the next steps and hopefully get feedback from Ravi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add the DISK alert to this list: https://github.com/elastic/kibana/blob/master/x-pack/plugins/monitoring/public/views/elasticsearch/nodes/index.js#L86
x-pack/plugins/monitoring/server/lib/alerts/fetch_disk_usage_node_stats.ts
Show resolved
Hide resolved
Also, @hbharding came up with some changes to the panel that I think will make it look better. WDYT? |
This seems like it should be separate issue, maybe? Or, I could try doing it here, just feel like there'll be some back n forth if there isn't the how it should look picture |
Yes, good point! Let's tackle it separately |
createLink( | ||
'xpack.monitoring.alerts.diskUsage.ui.nextSteps.resizeYourDeployment', | ||
'Resize your deployment (ECE)', | ||
`{elasticWebsiteUrl}/guide/en/cloud-enterprise/{docLinkVersion}/ece-resize-deployment.html` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think cloud docs work the same way. The correct url seems to be https://www.elastic.co/guide/en/cloud-enterprise/current/ece-resize-deployment.html whereas this code generates https://www.elastic.co/guide/en/cloud-enterprise/master/ece-resize-deployment.html
}, [] as string[]); | ||
firingNodeUuids.sort(); // It doesn't matter how we sort, but keep the order consistent | ||
const instanceId = `${this.type}:${cluster.clusterUuid}:${firingNodeUuids.join(',')}`; | ||
const instanceId = `.monitoring:${this.type}:${cluster.clusterUuid}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this does help with maintaining instance state for after resolving an alert, it introduces another problem. The unique instance id is where the throttling is enforced - meaning if you create an instance with a previously used id, the actions are subject to the throttle period started by the first time you used that instance.
In this case, since the instance id is based off the cluster id, this will never change from cluster to cluster, even if the number of nodes that are firing changes.
Imagine a 3 node cluster (A, B, C) and node A is firing an alert, this instance id will be .monitoring:cpu_usage:clusterUuid
and the actions will fire and then the throttling will start for all .monitoring:cpu_usage:clusterUuid
instances. Now, imagine node A resolves itself, but node B starts firing an alert. This will run and try and fire the actions, but it will be subject to the throttle period (which by default is 1d
) so they wouldn't see any messaging about it.
This is why I originally did it by firingNodeUuids
to ensure we generated a unique instance id based on what was actually firing.
I'm not sure we can go in this direction because I worry that our alerting will miss valid cases where it should send actions and we will lose trust with our users.
WDYT?
oldState.ui.isFiring !== newAlertState.ui.isFiring && | ||
oldState.ui.resolvedMS !== newAlertState.ui.resolvedMS | ||
); | ||
if (!relatedOldState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this logic here.
If the the old state is the same node as the new state, but the isFiring
flipped and resolvedMS
is different, that means we need to fire a resolution? I'm not sure we even need to look at resolvedMS
. If the isFiring
went from true
to false
, then we need to fire resolved actions.
Or am I just not reading this properly?
} | ||
|
||
if (deltaFiringStates.length) { | ||
const instance = services.alertInstanceFactory(`${deltaInstanceIdPrefix}:firing`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea a lot - We can just execute actions off a unique instance id for firing and resolved
} | ||
} | ||
|
||
protected async processData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we did something like this for processData
to handle the resolutions?
https://gist.github.com/chrisronline/8cc094cd1876e895746d5e91db84be7c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This working properly actually uncovers a UX issue in the UI where we don't really surface this well. We should really defer this to a separate PR IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A couple of things I noticed
💚 Build SucceededMetrics [docs]@kbn/optimizer bundle module count
async chunks size
distributable file count
page load bundle size
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Awesome job!
Backport: |
* master: (97 commits) [Actions] Adds a "Test Connector" button on the Connectors List to make discovery of the Test tab easier (elastic#78746) [Discover] Fix functional time picker test permissions (elastic#78564) [ML] Fixing module datafeed overrides (elastic#78925) Adds some missing licenses to the CSV export (elastic#78719) [dev/cli] ensure plugins/ and all watch source dirs exist (elastic#78973) [Lens] Stop using scripted metric to collect telemetry (elastic#78687) [Lens] fix wrong message in fields accordion (elastic#78924) [Enterprise Search][App Search] Credentials Logic updates (elastic#78644) [Monitoring] Disk usage alerting (elastic#75419) [SECURITY_SOLUTION] Trusted apps list expand/collapse details (elastic#78601) Update content on interstitial page (elastic#78881) chore(NA): include hjson as a prod dependency (elastic#78941) Fix empty meta fields input in Advanced Settings (elastic#78576) [Lens] Maintain order of operations in dimension panel (elastic#78864) Fix plugin doc title (elastic#78880) load apm-rum agent lazily (elastic#78760) [ML] Skip full ML access permission test Optimize charts plugin (elastic#78922) ui_actions service initial docs (elastic#78902) skip failing suite (elastic#78942) ...
Resolves #74819
This is part of the "Additional Alerting" effort for Stack Monitoring
The check calculates each data node to make sure the disk usage is below the implied threshold.
Testing: