Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: When defining an Alert Silencing Rule I should be able to filter down till the alert instance (chart name) #990

Closed
hugovalente-pm opened this issue Mar 20, 2024 · 7 comments

Comments

@hugovalente-pm
Copy link
Contributor

Problem

Using a cause of disk.space alert on multiple Mount Points, if I want to silencing the alert for a specific Mount Point I'm not able to do it with the existing available attributes:

  • Alert name
  • Alert context (chart context)

image

Description

To be able to have a finer-grain control on silencing some specific alert instances, be it Mount Points, Network Devices, or even Database Instances, Netdata should provide that level of flexibility

Importance

really want

Value proposition

  1. Provide more flexibility on the Alert Notification Silencing Rules

Proposed implementation

There are two considered options:

  1. Use the current available attributes on an alert chart name (display) / chart id (store in DB)

    • when user goes from an Active Alert this should be immediately pre-filled
    • when user start from a blank rule, the new attribute should only be available once the user fills in either "alert name" or "alert context", so we can provide a pre-filtered list of "alert instances" (name TBC)
  2. Rely on chart labels, like the current alert definitions do (check learn here) to allow the user to specify how he ensure some specific alerts over given chart(s) are silenced

@hugovalente-pm
Copy link
Contributor Author

@car12o made two proposals of solution based on what we discussed on the daily, I know for 2. you had to check something before we know it is a way forward
do you think you could update this ticket with your finding when you are able to do it?

@car12o
Copy link

car12o commented Mar 20, 2024

I can confirm on alert transition we don't have chart labels, but we have it on the alert config, although I don't know if it's what you expect.
here's some alert config examples:

template                           |chart                                                         |component    |units                |info                                                                                                                          |summary                                        |host_labels      |chart_labels                                           |
-----------------------------------+--------------------------------------------------------------+-------------+---------------------+------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+-----------------+-------------------------------------------------------+
mdstat_mismatch_cnt                |md.mismatch_cnt                                               |RAID         |unsynchronized blocks|number of unsynchronized blocks for the ${label:device} ${label:raid_level} array                                             |                                               |                 |raid_level=!raid1 !raid10 *                            |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
postgresql_pg_wall_disk_space_usage|disk.space                                                    |PostgreSQL   |%                    |The percentage of Disk Space being used by the pg_wall.                                                                       |Disk ${label:mount_point} (pg_wall) space usage|_os=linux freebsd|mount_point=/media/pgdata_adto                         |
DLE_CAS_sync_instance_lag          |DLE_CAS.sync_instance_lag                                     |sync_instance|seconds              |DLE_CAS Sync instance - high lag of WAL replay. Time stamp of last transaction replayed during recovery exceeds the threshold.|                                               |                 |_collect_module=DLE_CAS _collect_plugin=charts.d.plugin|
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |disk ${label:mount_point} inode utilization                                                                                   |                                               |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
rds_freeable_memory_alert          |prometheus.cloudwatch_exporter.aws_rds_freeable_memory_average|             |MB                   |AWS RDS instance freeable memory                                                                                              |                                               |                 |dbinstance_identifier= !koi-nonprod-infra-mysql *      |
disk_space_usage                   |disk.space                                                    |Disk         |%                    |Total space utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} space usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |
disk_inode_usage                   |disk.inodes                                                   |Disk         |%                    |Total inode utilization of disk ${label:mount_point}                                                                          |Disk ${label:mount_point} inode usage          |_os=linux freebsd|mount_point=!/dev !/dev/* !/run !/run/* *              |

bear in mind that we do have some configs without any chart labels.

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

@hugovalente-pm
Copy link
Contributor Author

bear in mind that we do have some configs without any chart labels.

I think this is probably because of older version agents where we weren't using labels

nevertheless, I think it's easier and more straight forward to filter by alert instance (chart_name/chart_id)

I agree it is easier for now, the discussion was about if it would make sense to go towards a more ideal solution relying on labels since it is also how we are setting this on alert definitions.
I'm ok to progress with the alert instance and we can revisit this later

@kapantzak from your side all good?

@car12o
Copy link

car12o commented Mar 21, 2024

I think this is probably because of older version agents where we weren't using labels

I don't think that's the case, as I sort by created timestamp and I still got some configs with empty chart labels.

@kapantzak
Copy link

@hugovalente-pm using alert instance seems easier and more straight forward to me too.

However I'm not sure if I have this information at that point. I see that I get contexts, names and roles from this endpoint: api/v2/spaces/{spaceID}/alarms/metas, but how do I get the instance?
@car12o

@car12o
Copy link

car12o commented Mar 21, 2024

@kapantzak here's how to get the data, let me know if something is not clear

Get instances from alert name or context

POST /api/v2/spaces/{spaceID}/rooms/{roomID}/alerts
body:

{
  "options": ["instances"],
  "scope": {
    "nodes": ["{nodeID}"], // if you want to filter by node
    "contexts": ["{context}"] // if want to get instances by context (ex. disk.space)
  },
  "selectors": {
    "alert": ["{alert_name}"] // if want to get instances by alert name (ex. disk_space_usage)
  }
}

all these parameters are optional but as we discuss, to filter out all possible instances, we should always either specify contexts or alert.

what identifies an alert instance is the chart, the response looks like this

{
  "api": 2,
  "nodes": [
    // ...
  ],
  "alert_instances": [
    {
      "ni": 2,
      "ati": null,
      "sum": "Disk / space usage",
      "info": "Total space utilization of disk /",
      "nm": "disk_space_usage",
      "ch": "disk_space._", // chart_id - this should be the field used when posting a rule
      "ch_n": "disk_space._", // chart_name - this should be the field used to display on the UI (friendly name)
      "ctx": "disk.space",
      "st": "CLEAR",
      "v": 0,
      "t": 0,
      "tr_i": "b047bf45-f831-49c8-b8de-6d76f5712858",
      "tr_v": 9.579043377550992,
      "tr_t": 1710857916,
      "units": "%",
      "cfg": "13038942-685d-4c69-9431-5d8877db1f80",
      "src": "line=10,file=/usr/lib/netdata/conf.d/health.d/disks.conf",
      "exec": "/usr/libexec/netdata/plugins.d/alarm-notify.sh",
      "tp": "System",
      "cl": "Utilization",
      "cm": "Disk",
      "to": "sysadmin",
      "slc": {
        "state": "NONE"
      }
    }
  ]
}

@hugovalente-pm
Copy link
Contributor Author

this is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants