Progress remaining O11y rule types to FAAD #169867

mikecote · 2023-10-25T18:37:20Z

elasticmachine · 2023-10-25T18:37:22Z

Pinging @elastic/response-ops (Team:ResponseOps)

mikecote · 2023-10-25T18:37:46Z

This issue isn't prioritized for 8.12 so I added an 8.13 label to have it as a candidate. We can backlog the issue for now.

mikecote · 2023-11-16T13:18:12Z

cc @maryam-saeidi we're using this issue to track the remaining O11y rules that need to onboard framework alerts-as-data APIs. It's not likely that we'll have this prioritized in 8.13 but we're more than happy to let someone else drive this (or part of) with our help.

shanisagiv1 · 2023-11-16T13:44:40Z

cc: @vinaychandrasekhar , per our recent discussion about AAD

maryam-saeidi · 2023-11-16T13:57:39Z

@mikecote Thanks for pinging me here; my point in the meeting was a suggestion about the possible meaning of that item. I'll ping @paulb-elastic regarding prioritization.

mikecote · 2023-11-16T16:55:49Z

Sounds good, no specific prioritization ask from us at this time, so it's ok if you don't have capacity 👍 but if you want to pick up the issue, we're more than happy to help!

mikecote · 2024-02-14T18:18:07Z

@maryam-saeidi fyi, we plan to make some progress in this area in 8.14.

ersin-erdal · 2024-03-18T23:01:16Z

Should we add Custom threshold to the list as well?
@mikecote

mikecote · 2024-03-19T10:46:34Z

@ersin-erdal yes good catch, please add it to the description 🙏

jasonrhodes · 2024-03-21T16:22:14Z

@mikecote can you point us to docs / etc where folks can read about what FAAD is? That stands for "Framework Alerts-as-Data", is that right? Thanks!

mikecote · 2024-03-21T16:32:18Z

@mikecote can you point us to docs / etc where folks can read about what FAAD is? That stands for "Framework Alerts-as-Data", is that right? Thanks!

@jasonrhodes You can read more about Framework Alerts-as-Data here: https://github.com/elastic/response-ops-team/issues/95 with the various phases we are unifying the architecture to have a the framework provide everything.

Towards: #169867 This PR onboards Inventory Metric Threshold rule type with FAAD. ## To verify. I used [data-generator](https://github.com/ersin-erdal/data-generator) to generate metric data. Then created an Inventory Threshold rule with actions (alert and recovered), conitions: `For Hosts, When CPU usage is above 10`. Inventory Threshold uses the following formula to calculate the result: (`system.cpu.user.pct` + `system.cpu.system.pct`) / `system.cpu.cores` Set `system.cpu.user.pct` = 1 `system.cpu.system.pct` = 1 `system.cpu.cores` = 4 in the [cpu-001](https://github.com/ersin-erdal/data-generator/blob/main/src/indexers/metrics/docs/cpu-001.json). This makes the CPU usage 0.5 (50%) for the `host-1` and run the generator with `./generate metrics` Your rule should create an alert and should saved it in `.internal.alerts-observability.metrics.alerts-default-000001` Then set `system.cpu.user.pct`=0 and `system.cpu.system.pct`=0. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

Towards: #169867 This PR onboards Log Threshold rule type with FAAD. ### To verify Create a log threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "logView": { "logViewId": "Default", "type": "log-view-reference" }, "timeSize": 5, "timeUnit": "m", "count": { "value": -1, "comparator": "more than" }, "criteria": [ { "field": "log.level", "comparator": "equals", "value": "error" } ] }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "test", "rule_type_id": "logs.alert.document.count", "notify_when": "onActionGroupChange", "actions": [] } ``` Your rule should create an alert and should saved it in `.internal.alerts-observability.metrics.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` Then set `count.value: 75` The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

Towards: #169867 This PR onboards "SLO burn rate" rule type with FAAD. ## To verify Create an SLO by using a test index (create a dataview for it), use very low `budget consumed %` The rule bound to the SLO should create an alert and save it under `.internal.alerts-observability.slo.alerts-default-000001`

Towards: #169867 This PR onboards "Custom Threshold" rule type with FAAD. ## To verify Create a Custom Threshold rule by using a test index and DW. Set the `Role visibility` `metrics`. When the rule runs, it generates an alert and saves it under `.internal.alerts-observability.threshold.alerts-default`. The alert should be visible on `Observability > alerts` page as well. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

Towards: #169867 This PR onboards Latency Threshold rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace simple_trace.ts --local --live ``` 2. Create a latency threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "aggregationType": "avg", "environment": "ENVIRONMENT_ALL", "threshold": 400, "windowSize": 5, "windowUnit": "m" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "testinggg", "rule_type_id": "apm.transaction_duration", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Set `threshold: 10000` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

Towards: #169867 This PR onboards the Error Count Threshold rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace many_errors.ts --local --live ``` 2. Create an error count threshold rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "threshold": 25, "windowSize": 5, "windowUnit": "m", "environment": "ENVIRONMENT_ALL" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "testinggg", "rule_type_id": "apm.error_rate", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Recover the alert by setting `threshold: 10000` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

towards: #169867 This PR onboards APM Anomaly rule type with FAAD. I am having trouble getting this rule to create an alert. If there is any easy way to verify pls let me know!

Towards: #169867 This PR onboards the Transaction Error Rate rule type with FAAD. ### To verify 1. Run the following script to generate APM data: ``` node scripts/synthtrace many_errors.ts --local --live ``` 2. Create a transaction error rate rule. Example: ``` POST kbn:/api/alerting/rule { "params": { "threshold": 0, "windowSize": 5, "windowUnit": "m", "environment": "ENVIRONMENT_ALL" }, "consumer": "alerts", "schedule": { "interval": "1m" }, "tags": [], "name": "test", "rule_type_id": "apm.transaction_error_rate", "notify_when": "onActionGroupChange", "actions": [] } ``` 3. Your rule should create an alert and should saved it in `.internal.alerts-observability.apm.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 4. Recover the alert by setting `threshold: 200` 5. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

Towards: #169867 This PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor status) with FAAD. We are deprecating the rule-registry plugin and onboard the rule types with the new alertsClient to manage alert-as-data. There is no new future, all the rule types should work as they were, and save alerts with all the existing fields. ## To verify: - Switch to Kibana 8.9.0 in your local repo. (In this version Uptime rules are not deprecated) - Run your ES with: `yarn es snapshot -E path.data=../local-es-data` - Run your Kibana - Create Uptime rules with an active and a recovered action (You can run Heartbeat locally if needed, [follow the instructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html)) - Stop your ES and Kibana - Switch to this branch and run your ES with `yarn es snapshot -E path.data=../local-es-data` again. - Run your Kibana - Modify Uptime rulesType codes to force them to create an alert. Example: Mock [availabilityResults in status_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491) with below data ``` availabilityResults = [ { monitorId: '1', up: 1, down: 0, location: 'location', availabilityRatio: 0.5, monitorInfo: { timestamp: '', monitor: { id: '1', status: 'down', type: 'type', check_group: 'default', }, docId: 'docid', }, }, ]; ``` It should create an alert. The alert should be saved under `.alerts-observability.uptime.alerts` index and be visible under observability alerts page. Then remove the mock, the alert should be recovered.

Towards: elastic#169867 This PR onboards Uptime rule types (Tls, Duration Anolamy and Monitor status) with FAAD. We are deprecating the rule-registry plugin and onboard the rule types with the new alertsClient to manage alert-as-data. There is no new future, all the rule types should work as they were, and save alerts with all the existing fields. ## To verify: - Switch to Kibana 8.9.0 in your local repo. (In this version Uptime rules are not deprecated) - Run your ES with: `yarn es snapshot -E path.data=../local-es-data` - Run your Kibana - Create Uptime rules with an active and a recovered action (You can run Heartbeat locally if needed, [follow the instructions](https://www.elastic.co/guide/en/beats/heartbeat/current/heartbeat-installation-configuration.html)) - Stop your ES and Kibana - Switch to this branch and run your ES with `yarn es snapshot -E path.data=../local-es-data` again. - Run your Kibana - Modify Uptime rulesType codes to force them to create an alert. Example: Mock [availabilityResults in status_check](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/uptime/server/legacy_uptime/lib/alerts/status_check.ts#L491) with below data ``` availabilityResults = [ { monitorId: '1', up: 1, down: 0, location: 'location', availabilityRatio: 0.5, monitorInfo: { timestamp: '', monitor: { id: '1', status: 'down', type: 'type', check_group: 'default', }, docId: 'docid', }, }, ]; ``` It should create an alert. The alert should be saved under `.alerts-observability.uptime.alerts` index and be visible under observability alerts page. Then remove the mock, the alert should be recovered. (cherry picked from commit d228f48)

# Backport This will backport the following commits from `main` to `8.14`: - [Onboard Uptime rule types with FAAD (#179493)](#179493)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ersin Erdal <92688503+ersin-erdal@users.noreply.github.com>

@timestamp

Towards: #169867 This PR onboards the Synthetics Monitor Status rule type with FAAD. ### To verify I can't get the rule to alert, so I modified the status check to report the monitor as down. If you know of an easier way pls let me know 🙂 1. Create a [monitor](http://localhost:5601/app/synthetics/monitors), by default creating a monitor creates a rule. 2. Click on the monitor and grab the id and locationId from the url 3. Go to [the status check code](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/synthetics/server/queries/query_monitor_status.ts#L208) and replace the object that is returned with the following using the id and locationId you got from the monitor. ``` { up: 0, down: 1, pending: 0, upConfigs: {}, pendingConfigs: {}, downConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 5. Your rule should create an alert and should saved it in `.internal.alerts-observability.uptime.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 6. Recover repeating step 3 using ``` { up: 1, down: 0, pending: 0, downConfigs: {}, pendingConfigs: {}, upConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 8. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

@timestamp

Towards: elastic#169867 This PR onboards the Synthetics Monitor Status rule type with FAAD. ### To verify I can't get the rule to alert, so I modified the status check to report the monitor as down. If you know of an easier way pls let me know 🙂 1. Create a [monitor](http://localhost:5601/app/synthetics/monitors), by default creating a monitor creates a rule. 2. Click on the monitor and grab the id and locationId from the url 3. Go to [the status check code](https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/synthetics/server/queries/query_monitor_status.ts#L208) and replace the object that is returned with the following using the id and locationId you got from the monitor. ``` { up: 0, down: 1, pending: 0, upConfigs: {}, pendingConfigs: {}, downConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 5. Your rule should create an alert and should saved it in `.internal.alerts-observability.uptime.alerts-default-000001` Example: ``` GET .internal.alerts-*/_search ``` 6. Recover repeating step 3 using ``` { up: 1, down: 0, pending: 0, downConfigs: {}, pendingConfigs: {}, upConfigs: { '${id}-${locationId}': { configId: '${id}', monitorQueryId: '${id}', status: 'down', locationId: '${locationId}', ping: { '@timestamp': new Date().toISOString(), state: { id: 'test-state', }, monitor: { name: 'test-monitor', }, observer: { name: 'test-monitor', }, } as any, timestamp: new Date().toISOString(), }, }, enabledMonitorQueryIds: ['${id}'], }; ``` 8. The alert should be recovered and the AAD in the above index should be updated `kibana.alert.status: recovered`.

mikecote added Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types labels Oct 25, 2023

mikecote added this to AppEx: ResponseOps - Execution & Connectors Oct 25, 2023

github-project-automation bot moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Oct 25, 2023

ymao1 moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Oct 26, 2023

heespi changed the title ~~Onboard remaining O11y rule types to FAAD~~ Progress remaining O11y rule types to FAAD Feb 13, 2024

mikecote assigned doakalexi and ersin-erdal Feb 14, 2024

mikecote moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Mar 11, 2024

ersin-erdal mentioned this issue Mar 12, 2024

Onboard Inventory Metric Threshold rule type with FAAD #178506

Merged

doakalexi mentioned this issue Mar 14, 2024

Onboard Log Threshold rule type with FAAD #178680

Merged

ersin-erdal mentioned this issue Mar 18, 2024

Onboard "SLO burn rate" rule type with FAAD #178922

Merged

doakalexi mentioned this issue Mar 20, 2024

Onboard Latency Threshold rule type with FAAD #179080

Merged

doakalexi mentioned this issue Mar 21, 2024

Onboard APM Anomaly rule type with FAAD #179196

Merged

doakalexi mentioned this issue Mar 22, 2024

Onboard Error Count Threshold rule type with FAAD #179275

Merged

ersin-erdal mentioned this issue Mar 22, 2024

Onboard "Custom Threshold" rule type with FAAD #179284

Merged

ersin-erdal mentioned this issue Apr 2, 2024

Onboard Uptime rule types with FAAD #179493

Merged

doakalexi mentioned this issue Apr 2, 2024

Onboard Transaction Error Rate rule type with FAAD #179496

Merged

mikecote moved this from In Progress to Todo in AppEx: ResponseOps - Execution & Connectors May 2, 2024

doakalexi mentioned this issue Jun 13, 2024

Onboard Synthetics Monitor Status rule type with FAAD #186214

Merged

doakalexi mentioned this issue Jun 20, 2024

Onboard Synthetics TLS rule type with FAAD #186551

Closed

doakalexi mentioned this issue Aug 22, 2024

Onboard Synthetics TLS rule type with FAAD #191127

Merged

doakalexi closed this as completed in #191127 Aug 27, 2024

doakalexi closed this as completed in 51e76d8 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Progress remaining O11y rule types to FAAD #169867

Progress remaining O11y rule types to FAAD #169867

mikecote commented Oct 25, 2023 •

edited by doakalexi

Loading

elasticmachine commented Oct 25, 2023

mikecote commented Oct 25, 2023

mikecote commented Nov 16, 2023 •

edited

Loading

shanisagiv1 commented Nov 16, 2023

maryam-saeidi commented Nov 16, 2023

mikecote commented Nov 16, 2023

mikecote commented Feb 14, 2024

ersin-erdal commented Mar 18, 2024 •

edited

Loading

mikecote commented Mar 19, 2024

jasonrhodes commented Mar 21, 2024

mikecote commented Mar 21, 2024

Progress remaining O11y rule types to FAAD #169867

Progress remaining O11y rule types to FAAD #169867

Comments

mikecote commented Oct 25, 2023 • edited by doakalexi Loading

Definition of Done

elasticmachine commented Oct 25, 2023

mikecote commented Oct 25, 2023

mikecote commented Nov 16, 2023 • edited Loading

shanisagiv1 commented Nov 16, 2023

maryam-saeidi commented Nov 16, 2023

mikecote commented Nov 16, 2023

mikecote commented Feb 14, 2024

ersin-erdal commented Mar 18, 2024 • edited Loading

mikecote commented Mar 19, 2024

jasonrhodes commented Mar 21, 2024

mikecote commented Mar 21, 2024

mikecote commented Oct 25, 2023 •

edited by doakalexi

Loading

mikecote commented Nov 16, 2023 •

edited

Loading

ersin-erdal commented Mar 18, 2024 •

edited

Loading