Skip to content

Latest commit

 

History

History
273 lines (203 loc) · 11.3 KB

server.rst

File metadata and controls

273 lines (203 loc) · 11.3 KB

Server & API

The Alerta API receives alerts from multiple sources, :ref:`correlates <correlation>`, :ref:`de-duplicates <deduplication>` or :ref:`suppresses <blackout periods>` them, and makes the alerts available via a RESTful JSON API.

Alerts can be intercepted as they are received to modify, enhance or reject them using :ref:`pre-receive hooks <prereceive>`. Alerts can also be used to trigger actions in other systems after the alert has been processed using :ref:`post-receive hooks <postreceive>` or following an :ref:`operator action <take_action>` or alert :ref:`status change <status_change>` for bi-directional integration.

There are several :ref:`integrations <integrations>` with popular monitoring tools available and :ref:`webhooks <webhooks>` can be used to trivially integrate with AWS Prometheus, Grafana, PagerDuty and many more.

Event Processing

Alerta comes out-of-the-box with key features designed to reduce the burden of alert management. When an event is received it it is processed in the following way:

  1. all plugin pre-receive hooks are run in listed order, an alert is immediately rejected if any plugins return a RejectException or RateLimit exception
  2. alert is checked against any active blackout periods, alert suppressed if any match
  3. alert is checked if duplicate, if so duplicate count is increased and repeat set to True
  4. alert is checked if correlated, if so change severity and/or status etc
  5. if alert is neither a duplicate or correlated then create new alert
  6. all plugin post-receive hooks are run in listed order
  7. any tags or attributes changed in post-receive hooks are persisted

Each of the above actions are explained in more detail in the following sections.

Plugins

Plugins are small python scripts that can run either before or after an alert is saved to the database, or before an operator action or status change update. This is achieved by registering pre-receive hooks for transformers, post-receive hooks for external notification and status change hooks for bi-directional integration.

Transformers

Using pre-receive hooks, plugins provide the ability to transform raw alert data from sources before alerts are created. For example, alerts can be normalised to ensure they all have specific attributes or tags or only have a specific value from a range of allowed values. This is demonstrated in the reject plugin that enforces an alert policy.

Plugins can also be used to enhance alerts -- like the Geo location plugin which adds location data to alerts based on the remote IP address of the client, or the generic enhance plugin which adds a customer attribute based on information contained in the alert.

External Notification

Using post-receive hooks, plugin integrations can be used to provide downstream systems with alerts in realtime for external notification. For example, pushing alerts onto an AWS SNS topic, AMQP queue, logging to a Logstash/Kibana stack, or sending notifications to HipChat, Slack or Twilio and many more.

Operator Actions

Actions taken against alerts can be used as triggers for further integrations with external systems.

TBC

Bi-directional Integration

Using status change hooks, plugins can be used to complete a two way integration with an external system. That is, an external system like Prometheus Alertmanager that generates alerts that are forwarded to Alerta can be updated when the status of an alert changes in Alerta.

For example, if an operator "acknowledges" a Prometheus alert in the Alerta web UI then a status change hook could silence the corresponding alert in Alertmanager. This requires that external systems provide enough information in the alert created in Alerta for that alert to be uniquely identified at a later date.

More information about bi-directional integration and real-world examples for Telegram, Zabbix, Prometheus and many others can be found on the :ref:`Integrations & Plugins<bidirection integ>` page.

Blackout Periods

An alert that is received during a :index:`blackout period <single: blackouts>` is suppressed. That is, it is received by Alerta and a 202 Accepted status code is returned however this means that even though the alert has been accepted, it won't be processed.

Alerta defines many different alert attributes that can be used to group alerts and it is these attributes that can be used to define blackout rules. For example, to suppress alerts from an entire environment, service or group, or a combination of these. However, it is possible to define blackout rules based only on resource and event attributes for situations that require that level of granularity.

Tags can also be used to define a blackout rule which should allow a lot of flexibility because tags can be added at source, using the alerta CLI, or using a plugin. Note that one or more tags can be required to match an alert for the suppression to apply.

In summary, blackout rules can be any of:

  • an entire environment eg. environment=Production
  • a particular resource eg. resource=host55
  • an entire service eg. service=Web
  • every occurrence of a specific event eg. event=DiskFull
  • a group of events eg. group=Syslog
  • a specific event for a resource eg. resource=host55 and event=DiskFull
  • all events that have a specific set of tags eg. tags=[ blackout, london ]

Note that an environment is always required to be defined for a blackout rule.

De-Duplication

When an alert with the same environment-resource-event combination is received with the same severity, the alert is de-duplicated.

This means that information from the de-duplicated alert is used to update key attributes of the existing alert (like duplicateCount, repeat flag, value, text and lastReceiveTime) and the new alert is not shown.

Alerts are sorted in the Alerta web UI by lastReceiveTime by default so that the most recent alerts will be displayed at the top regardless of whether they were new alerts or de-duplicated alerts.

Simple Correlation

Alerta implements what we call "simple correlation" -- as opposed to complex correlation which is much more involved. Simple correlation, in combination with de-duplication, provides straight-forward and effective ways to reduce the burden of managing an alert console.

With Alerta, there are two ways alerts can be correlated, namely:

  1. When an alert with the same environment-resource-event combination is received with a different severity, then the alert is correlated.
  2. When a alert with the same environment-resource combination is received with an event in the correlate list of related events with any severity, then the alert is correlated.

In both cases, this means that information from the correlated alert is used to update key attributes of the existing alert (like severity, event, value, text and lastReceiveTime) and the new alert is not shown.

State-based Browser

Alerta is called state-based because it will automatically change the alert status based on the current and previous severity of alerts and subsequent user actions.

The Alerta API will:

  • only show the most recent state of any alert
  • change the status of an alert to closed if a normal, ok or cleared is received
  • change the status of a closed alert to open if the event reoccurs
  • change the status of an acknowledged alert to open if the new severity is higher than the current severity
  • update the severity and other key attributes of an alert when a more recent alert is received (see correlation and deduplication)
  • update the trendIndication attribute based on previousSeverity and current severity with either moreSevere, lessSevere or noChange
  • update the history log following a severity or status change (see alert history)

All of these automatic actions combine to ensure that important alerts are given the priority they deserve.

Note

To take full advantage of the state-based browser it is recommended to implement the timeout of expired alerts using the :ref:`housekeeping` script.

Alert History

Whenever an alert status or severity changes, that change is recorded in the alert :ref:`history <history>` log. This is to allow operations staff follow the lifecycle of a particular alert, if necessary.

The alert history is visible in the Alert Details page of any alert and also by using the alerta command-line tool history sub-command.

For example, it will show whether an alert status change happened as a result of operator (external) action or an automatic correlation (auto) action.

Heartbeats

An Alerta :ref:`heartbeat <Heartbeats>` is a periodic HTTP request sent to the Alerta API to indicate normal operation of the origin of the heartbeat.

They can be used to ensure components of the Alerta monitoring system are operating normally or sent from any other source. As well as an origin they include a timeout in seconds (after which they will be considered stale), and optional tags and attributes.

They are visible in the Alerta console (Heartbeats page) and via the alerta command-line tool using the heartbeat sub-command to send them, and the heartbeats sub-command to view them.

Alerts can be generated from :index:`stale or slow heartbeats <pair: heartbeat; stale>` using alerta heartbeats --alert. For more information about generating alerts from heartbeats see the :ref:`heartbeats command<cli_heartbeats>` reference.