[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

hugovalente-pm · 2024-03-20T16:46:40Z

Problem

Using a cause of disk.space alert on multiple Mount Points, if I want to silencing the alert for a specific Mount Point I'm not able to do it with the existing available attributes:

Alert name
Alert context (chart context)

Description

To be able to have a finer-grain control on silencing some specific alert instances, be it Mount Points, Network Devices, or even Database Instances, Netdata should provide that level of flexibility

Importance

really want

Value proposition

Provide more flexibility on the Alert Notification Silencing Rules

Proposed implementation

There are two considered options:

Use the current available attributes on an alert chart name (display) / chart id (store in DB)
- when user goes from an Active Alert this should be immediately pre-filled
- when user start from a blank rule, the new attribute should only be available once the user fills in either "alert name" or "alert context", so we can provide a pre-filtered list of "alert instances" (name TBC)
Rely on chart labels, like the current alert definitions do (check learn here) to allow the user to specify how he ensure some specific alerts over given chart(s) are silenced

The text was updated successfully, but these errors were encountered:

hugovalente-pm · 2024-03-20T16:47:54Z

@car12o made two proposals of solution based on what we discussed on the daily, I know for 2. you had to check something before we know it is a way forward
do you think you could update this ticket with your finding when you are able to do it?

car12o · 2024-03-20T20:05:40Z

I can confirm on alert transition we don't have chart labels, but we have it on the alert config, although I don't know if it's what you expect.
here's some alert config examples:

template                           |chart                                                         |component    |units                |info                                                                                                                          |summary                                        |host_labels      |chart_labels                                           |
-----------------------------------+--------------------------------------------------------------+-------------+---------------------+------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+-----------------+-------------------------------------------------------+
mdstat_mismatch_cnt                |md.mismatch_cnt                                               |RAID         |unsynchronized blocks|number of unsynchronized blocks for the ${label:device} ${label:raid_level} array                                             |                                               |                 |raid_level=!raid1 !raid10 *                            |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
postgresql_pg_wall_disk_space_usage|disk.space                                                    |PostgreSQL   |%                    |The percentage of Disk Space being used by the pg_wall.                                                                       |Disk ${label:mount_point} (pg_wall) space usage|_os=linux freebsd|mount_point=/media/pgdata_adto                         |
DLE_CAS_sync_instance_lag          |DLE_CAS.sync_instance_lag                                     |sync_instance|seconds              |DLE_CAS Sync instance - high lag of WAL replay. Time stamp of last transaction replayed during recovery exceeds the threshold.|                                               |                 |_collect_module=DLE_CAS _collect_plugin=charts.d.plugin|
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |disk ${label:mount_point} inode utilization                                                                                   |                                               |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
rds_freeable_memory_alert          |prometheus.cloudwatch_exporter.aws_rds_freeable_memory_average|             |MB                   |AWS RDS instance freeable memory                                                                                              |                                               |                 |dbinstance_identifier= !koi-nonprod-infra-mysql *      |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |Total inode utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} inode usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |

bear in mind that we do have some configs without any chart labels.

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

hugovalente-pm · 2024-03-21T11:19:49Z

bear in mind that we do have some configs without any chart labels.

I think this is probably because of older version agents where we weren't using labels

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

I agree it is easier for now, the discussion was about if it would make sense to go towards a more ideal solution relying on labels since it is also how we are setting this on alert definitions.
I'm ok to progress with the alert instance and we can revisit this later

@kapantzak from your side all good?

car12o · 2024-03-21T11:35:02Z

I think this is probably because of older version agents where we weren't using labels

I don't think that's the case, as I sort by created timestamp and I still got some configs with empty chart labels.

kapantzak · 2024-03-21T13:47:30Z

@hugovalente-pm using alert instance seems easier and more straight forward to me too.

However I'm not sure if I have this information at that point. I see that I get contexts, names and roles from this endpoint: api/v2/spaces/{spaceID}/alarms/metas, but how do I get the instance?
@car12o

car12o · 2024-03-21T16:16:22Z

@kapantzak here's how to get the data, let me know if something is not clear

Get instances from alert name or context

POST /api/v2/spaces/{spaceID}/rooms/{roomID}/alerts
body:

{
  "options": ["instances"],
  "scope": {
    "nodes": ["{nodeID}"], // if you want to filter by node
    "contexts": ["{context}"] // if want to get instances by context (ex. disk.space)
  },
  "selectors": {
    "alert": ["{alert_name}"] // if want to get instances by alert name (ex. disk_space_usage)
  }
}

all these parameters are optional but as we discuss, to filter out all possible instances, we should always either specify contexts or alert.

what identifies an alert instance is the chart, the response looks like this

{
  "api": 2,
  "nodes": [
    // ...
  ],
  "alert_instances": [
    {
      "ni": 2,
      "ati": null,
      "sum": "Disk / space usage",
      "info": "Total space utilization of disk /",
      "nm": "disk_space_usage",
      "ch": "disk_space._", // chart_id - this should be the field used when posting a rule
      "ch_n": "disk_space._", // chart_name - this should be the field used to display on the UI (friendly name)
      "ctx": "disk.space",
      "st": "CLEAR",
      "v": 0,
      "t": 0,
      "tr_i": "b047bf45-f831-49c8-b8de-6d76f5712858",
      "tr_v": 9.579043377550992,
      "tr_t": 1710857916,
      "units": "%",
      "cfg": "13038942-685d-4c69-9431-5d8877db1f80",
      "src": "line=10,file=/usr/lib/netdata/conf.d/health.d/disks.conf",
      "exec": "/usr/libexec/netdata/plugins.d/alarm-notify.sh",
      "tp": "System",
      "cl": "Utilization",
      "cm": "Disk",
      "to": "sysadmin",
      "slc": {
        "state": "NONE"
      }
    }
  ]
}

hugovalente-pm · 2024-04-19T10:00:35Z

this is released

hugovalente-pm added needs triage feature request cloud-frontend cloud-backend area/alert-notifications and removed needs triage labels Mar 20, 2024

hugovalente-pm added this to the Immediate backlog milestone Apr 4, 2024

hugovalente-pm added the initiative label Apr 4, 2024

hugovalente-pm assigned kapantzak Apr 11, 2024

hugovalente-pm modified the milestones: Immediate backlog, 2024 Q1 Now Apr 11, 2024

hugovalente-pm assigned car12o Apr 11, 2024

hugovalente-pm closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

hugovalente-pm commented Mar 20, 2024

hugovalente-pm commented Mar 20, 2024

car12o commented Mar 20, 2024

hugovalente-pm commented Mar 21, 2024

car12o commented Mar 21, 2024

kapantzak commented Mar 21, 2024

car12o commented Mar 21, 2024 •

edited

Loading

hugovalente-pm commented Apr 19, 2024

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

Comments

hugovalente-pm commented Mar 20, 2024

Problem

Description

Importance

Value proposition

Proposed implementation

hugovalente-pm commented Mar 20, 2024

car12o commented Mar 20, 2024

hugovalente-pm commented Mar 21, 2024

car12o commented Mar 21, 2024

kapantzak commented Mar 21, 2024

car12o commented Mar 21, 2024 • edited Loading

Get instances from alert name or context

hugovalente-pm commented Apr 19, 2024

car12o commented Mar 21, 2024 •

edited

Loading