Skip to content

Commit

Permalink
AlertManager v2 (#2643)
Browse files Browse the repository at this point in the history
Introduce AlertManager v2 integration with improved internal behaviour

it's using grouping from AlertManager, not trying to re-group alerts on
OnCall side.
Existing AlertManager and Grafana Alerting integrations are marked as
Legacy with options to migrate them manually now or be migrated
automatically after DEPRECATION DATE(TBD).
Integration urls and public api responses stay the same both for legacy
and new integrations.

---------

Co-authored-by: Rares Mardare <rares.mardare@grafana.com>
Co-authored-by: Joey Orlando <joey.orlando@grafana.com>
  • Loading branch information
3 people authored Aug 1, 2023
1 parent d90c4d9 commit 1ccb9d6
Show file tree
Hide file tree
Showing 38 changed files with 1,762 additions and 748 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- [Helm] Add `extraContainers` for engine, celery and migrate-job pods to define sidecars by @lu1as ([#2650](https://github.com/grafana/oncall/pull/2650))
– Rework of AlertManager integration ([#2643](https://github.com/grafana/oncall/pull/2643))

## v1.3.20 (2023-07-31)

Expand Down
64 changes: 59 additions & 5 deletions docs/sources/integrations/alertmanager/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@ weight: 300

# Alertmanager integration for Grafana OnCall

> You must have the [role of Admin][user-and-team-management] to be able to create integrations in Grafana OnCall.
> ⚠️ A note about **(Legacy)** integrations:
> We are changing internal behaviour of AlertManager integration.
> Integrations that were created before version 1.3.21 are marked as **(Legacy)**.
> These integrations are still receiving and escalating alerts but will be automatically migrated after 1 November 2023.
> <br/><br/>
> To ensure a smooth transition you can migrate legacy integrations by yourself now.
> [Here][migration] you can read more about changes and migration process.
The Alertmanager integration handles alerts from [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/).
This integration is the recommended way to send alerts from Prometheus deployed in your infrastructure, to Grafana OnCall.
Expand All @@ -30,16 +36,14 @@ This integration is the recommended way to send alerts from Prometheus deployed
4. A new page will open with the integration details. Copy the **OnCall Integration URL** from **HTTP Endpoint** section.
You will need it when configuring Alertmanager.

<!--![123](../_images/connect-new-monitoring.png)-->

## Configuring Alertmanager to Send Alerts to Grafana OnCall

1. Add a new [Webhook](https://prometheus.io/docs/alerting/latest/configuration/#webhook_config) receiver to `receivers`
section of your Alertmanager configuration
2. Set `url` to the **OnCall Integration URL** from previous section
- **Note:** The url has a trailing slash that is required for it to work properly.
3. Set `send_resolved` to `true`, so Grafana OnCall can autoresolve alert groups when they are resolved in Alertmanager
4. It is recommended to set `max_alerts` to less than `300` to avoid rate-limiting issues
4. It is recommended to set `max_alerts` to less than `100` to avoid requests that are too large.
5. Use this receiver in your route configuration

Here is the example of final configuration:
Expand All @@ -54,7 +58,7 @@ receivers:
webhook_configs:
- url: <integation-url>
send_resolved: true
max_alerts: 300
max_alerts: 100
```
## Complete the Integration Configuration
Expand Down Expand Up @@ -113,10 +117,60 @@ Add receiver configuration to `prometheus.yaml` with the **OnCall Heartbeat URL*
send_resolved: false
```

## Migrating from Legacy Integration

Before we were using each alert from AlertManager group as a separate payload:

```json
{
"labels": {
"severity": "critical",
"alertname": "InstanceDown"
},
"annotations": {
"title": "Instance localhost:8081 down",
"description": "Node has been down for more than 1 minute"
},
...
}
```

This behaviour was leading to mismatch in alert state between OnCall and AlertManager and draining of rate-limits,
since each AlertManager alert was counted separately.

We decided to change this behaviour to respect AlertManager grouping by using AlertManager group as one payload.

```json
{
"alerts": [...],
"groupLabels": {"alertname": "InstanceDown"},
"commonLabels": {"job": "node", "alertname": "InstanceDown"},
"commonAnnotations": {"description": "Node has been down for more than 1 minute"},
"groupKey": "{}:{alertname=\"InstanceDown\"}",
...
}
```

You can read more about AlertManager Data model [here](https://prometheus.io/docs/alerting/latest/notifications/#data).

### How to migrate

> Integration URL will stay the same, so no need to change AlertManager or Grafana Alerting configuration.
> Integration templates will be reset to suit new payload.
> It is needed to adjust routes manually to new payload.

1. Go to **Integration Page**, click on three dots on top right, click **Migrate**
2. Confirmation Modal will be shown, read it carefully and proceed with migration.
3. Send demo alert to make sure everything went well.
4. Adjust routes to the new shape of payload. You can use payload of the demo alert from previous step as an example.

{{% docs/reference %}}
[user-and-team-management]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/user-and-team-management"
[user-and-team-management]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/user-and-team-management"

[complete-the-integration-configuration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations#complete-the-integration-configuration"
[complete-the-integration-configuration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations#complete-the-integration-configuration"

[migration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations/alertmanager#migrating-from-legacy-integration"
[migration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations/alertmanager#migrating-from-legacy-integration"
{{% /docs/reference %}}
67 changes: 62 additions & 5 deletions docs/sources/integrations/grafana-alerting/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ weight: 100

# Grafana Alerting integration for Grafana OnCall

> ⚠️ A note about **(Legacy)** integrations:
> We are changing internal behaviour of Grafana Alerting integration.
> Integrations that were created before version 1.3.21 are marked as **(Legacy)**.
> These integrations are still receiving and escalating alerts but will be automatically migrated after 1 November 2023.
> <br/><br/>
> To ensure a smooth transition you can migrate them by yourself now.
> [Here][migration] you can read more about changes and migration process.
Grafana Alerting for Grafana OnCall can be set up using two methods:

- Grafana Alerting: Grafana OnCall is connected to the same Grafana instance being used to manage Grafana OnCall.
Expand Down Expand Up @@ -53,11 +61,9 @@ Connect Grafana OnCall with alerts coming from a Grafana instance that is differ
OnCall is being managed:

1. In Grafana OnCall, navigate to the **Integrations** tab and select **New Integration to receive alerts**.
2. Select the **Grafana (Other Grafana)** tile.
3. Follow the configuration steps that display in the **How to connect** window to retrieve your unique integration URL
and complete any necessary configurations.
4. Determine the escalation chain for the new integration by either selecting an existing one or by creating a
new escalation chain.
2. Select the **Alertmanager** tile.
3. Enter a name and description for the integration, click Create
4. A new page will open with the integration details. Copy the OnCall Integration URL from HTTP Endpoint section.
5. Go to the other Grafana instance to connect to Grafana OnCall and navigate to **Alerting > Contact Points**.
6. Select **New Contact Point**.
7. Choose the contact point type `webhook`, then paste the URL generated in step 3 into the URL field.
Expand All @@ -66,3 +72,54 @@ OnCall is being managed:
> see [Contact points in Grafana Alerting](https://grafana.com/docs/grafana/latest/alerting/unified-alerting/contact-points/).
8. Click the **Edit** (pencil) icon, then click **Test**. This will send a test alert to Grafana OnCall.

## Migrating from Legacy Integration

Before we were using each alert from Grafana Alerting group as a separate payload:

```json
{
"labels": {
"severity": "critical",
"alertname": "InstanceDown"
},
"annotations": {
"title": "Instance localhost:8081 down",
"description": "Node has been down for more than 1 minute"
},
...
}
```

This behaviour was leading to mismatch in alert state between OnCall and Grafana Alerting and draining of rate-limits,
since each Grafana Alerting alert was counted separately.

We decided to change this behaviour to respect Grafana Alerting grouping by using AlertManager group as one payload.

```json
{
"alerts": [...],
"groupLabels": {"alertname": "InstanceDown"},
"commonLabels": {"job": "node", "alertname": "InstanceDown"},
"commonAnnotations": {"description": "Node has been down for more than 1 minute"},
"groupKey": "{}:{alertname=\"InstanceDown\"}",
...
}
```

You can read more about AlertManager Data model [here](https://prometheus.io/docs/alerting/latest/notifications/#data).

### How to migrate

> Integration URL will stay the same, so no need to make changes on Grafana Alerting side.
> Integration templates will be reset to suit new payload.
> It is needed to adjust routes manually to new payload.
1. Go to **Integration Page**, click on three dots on top right, click **Migrate**
2. Confirmation Modal will be shown, read it carefully and proceed with migration.
3. Adjust routes to the new shape of payload.

{{% docs/reference %}}
[migration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations/grafana-alerting#migrating-from-legacy-integration"
[migration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations/grafana-alerting#migrating-from-legacy-integration"
{{% /docs/reference %}}
3 changes: 2 additions & 1 deletion engine/apps/alerts/integration_options_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ def __init__(self, *args, **kwargs):
for integration_config in _config:
vars()[f"INTEGRATION_{integration_config.slug.upper()}"] = integration_config.slug

INTEGRATION_TYPES = {integration_config.slug for integration_config in _config}

INTEGRATION_CHOICES = tuple(
(
(
Expand All @@ -39,7 +41,6 @@ def __init__(self, *args, **kwargs):
WEB_INTEGRATION_CHOICES = [
integration_config.slug for integration_config in _config if integration_config.is_displayed_on_web
]
PUBLIC_API_INTEGRATION_MAP = {integration_config.slug: integration_config.slug for integration_config in _config}
INTEGRATION_SHORT_DESCRIPTION = {
integration_config.slug: integration_config.short_description for integration_config in _config
}
Expand Down
37 changes: 37 additions & 0 deletions engine/apps/alerts/migrations/0030_auto_20230731_0341.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 3.2.19 on 2023-07-31 03:41

from django.db import migrations


integration_alertmanager = "alertmanager"
integration_grafana_alerting = "grafana_alerting"

legacy_alertmanager = "legacy_alertmanager"
legacy_grafana_alerting = "legacy_grafana_alerting"


def make_integrations_legacy(apps, schema_editor):
AlertReceiveChannel = apps.get_model("alerts", "AlertReceiveChannel")


AlertReceiveChannel.objects.filter(integration=integration_alertmanager).update(integration=legacy_alertmanager)
AlertReceiveChannel.objects.filter(integration=integration_grafana_alerting).update(integration=legacy_grafana_alerting)


def revert_make_integrations_legacy(apps, schema_editor):
AlertReceiveChannel = apps.get_model("alerts", "AlertReceiveChannel")


AlertReceiveChannel.objects.filter(integration=legacy_alertmanager).update(integration=integration_alertmanager)
AlertReceiveChannel.objects.filter(integration=legacy_grafana_alerting).update(integration=integration_grafana_alerting)


class Migration(migrations.Migration):

dependencies = [
('alerts', '0029_auto_20230728_0802'),
]

operations = [
migrations.RunPython(make_integrations_legacy, revert_make_integrations_legacy),
]
35 changes: 14 additions & 21 deletions engine/apps/alerts/models/alert_receive_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@
from apps.alerts.grafana_alerting_sync_manager.grafana_alerting_sync import GrafanaAlertingSyncManager
from apps.alerts.integration_options_mixin import IntegrationOptionsMixin
from apps.alerts.models.maintainable_object import MaintainableObject
from apps.alerts.tasks import disable_maintenance, sync_grafana_alerting_contact_points
from apps.alerts.tasks import disable_maintenance
from apps.base.messaging import get_messaging_backend_from_id
from apps.base.utils import live_settings
from apps.integrations.legacy_prefix import remove_legacy_prefix
from apps.integrations.metadata import heartbeat
from apps.integrations.tasks import create_alert, create_alertmanager_alerts
from apps.metrics_exporter.helpers import (
Expand Down Expand Up @@ -339,7 +340,8 @@ def is_demo_alert_enabled(self):

@property
def description(self):
if self.integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
# TODO: AMV2: Remove this check after legacy integrations are migrated.
if self.integration == AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING:
contact_points = self.contact_points.all()
rendered_description = jinja_template_env.from_string(self.config.description).render(
is_finished_alerting_setup=self.is_finished_alerting_setup,
Expand Down Expand Up @@ -421,7 +423,8 @@ def integration_url(self):
AlertReceiveChannel.INTEGRATION_MAINTENANCE,
]:
return None
return create_engine_url(f"integrations/v1/{self.config.slug}/{self.token}/")
slug = remove_legacy_prefix(self.config.slug)
return create_engine_url(f"integrations/v1/{slug}/{self.token}/")

@property
def inbound_email(self):
Expand Down Expand Up @@ -552,7 +555,12 @@ def send_demo_alert(self, payload=None):
if payload is None:
payload = self.config.example_payload

if self.has_alertmanager_payload_structure:
# TODO: AMV2: hack to keep demo alert working for integration with legacy alertmanager behaviour.
if self.integration in {
AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING,
AlertReceiveChannel.INTEGRATION_LEGACY_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
}:
alerts = payload.get("alerts", None)
if not isinstance(alerts, list) or not len(alerts):
raise UnableToSendDemoAlert(
Expand All @@ -573,12 +581,8 @@ def send_demo_alert(self, payload=None):
)

@property
def has_alertmanager_payload_structure(self):
return self.integration in (
AlertReceiveChannel.INTEGRATION_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING,
)
def based_on_alertmanager(self):
return getattr(self.config, "based_on_alertmanager", False)

# Insight logs
@property
Expand Down Expand Up @@ -652,14 +656,3 @@ def listen_for_alertreceivechannel_model_save(
metrics_remove_deleted_integration_from_cache(instance)
else:
metrics_update_integration_cache(instance)

if instance.integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
if created:
instance.grafana_alerting_sync_manager.create_contact_points()
# do not trigger sync contact points if field "is_finished_alerting_setup" was updated
elif (
kwargs is None
or not kwargs.get("update_fields")
or "is_finished_alerting_setup" not in kwargs["update_fields"]
):
sync_grafana_alerting_contact_points.apply_async((instance.pk,), countdown=5)
4 changes: 2 additions & 2 deletions engine/apps/alerts/tests/test_alert_receiver_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,9 @@ def test_send_demo_alert(mocked_create_alert, make_organization, make_alert_rece
@pytest.mark.parametrize(
"integration",
[
AlertReceiveChannel.INTEGRATION_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_LEGACY_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING,
AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING,
],
)
@pytest.mark.parametrize(
Expand Down
11 changes: 10 additions & 1 deletion engine/apps/api/serializers/alert_receive_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from apps.alerts.models import AlertReceiveChannel
from apps.alerts.models.channel_filter import ChannelFilter
from apps.base.messaging import get_messaging_backends
from apps.integrations.legacy_prefix import has_legacy_prefix
from common.api_helpers.custom_fields import TeamPrimaryKeyRelatedField
from common.api_helpers.exceptions import BadRequest
from common.api_helpers.mixins import APPEARANCE_TEMPLATE_NAMES, EagerLoadingMixin
Expand Down Expand Up @@ -52,6 +53,7 @@ class AlertReceiveChannelSerializer(EagerLoadingMixin, serializers.ModelSerializ
routes_count = serializers.SerializerMethodField()
connected_escalations_chains_count = serializers.SerializerMethodField()
inbound_email = serializers.CharField(required=False)
is_legacy = serializers.SerializerMethodField()

# integration heartbeat is in PREFETCH_RELATED not by mistake.
# With using of select_related ORM builds strange join
Expand Down Expand Up @@ -90,6 +92,7 @@ class Meta:
"connected_escalations_chains_count",
"is_based_on_alertmanager",
"inbound_email",
"is_legacy",
]
read_only_fields = [
"created_at",
Expand All @@ -105,12 +108,15 @@ class Meta:
"connected_escalations_chains_count",
"is_based_on_alertmanager",
"inbound_email",
"is_legacy",
]
extra_kwargs = {"integration": {"required": True}}

def create(self, validated_data):
organization = self.context["request"].auth.organization
integration = validated_data.get("integration")
# if has_legacy_prefix(integration):
# raise BadRequest(detail="This integration is deprecated")
if integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
connection_error = GrafanaAlertingSyncManager.check_for_connection_errors(organization)
if connection_error:
Expand Down Expand Up @@ -185,6 +191,9 @@ def get_alert_groups_count(self, obj):
def get_routes_count(self, obj) -> int:
return obj.channel_filters.count()

def get_is_legacy(self, obj) -> bool:
return has_legacy_prefix(obj.integration)

def get_connected_escalations_chains_count(self, obj) -> int:
return (
ChannelFilter.objects.filter(alert_receive_channel=obj, escalation_chain__isnull=False)
Expand Down Expand Up @@ -262,7 +271,7 @@ def get_payload_example(self, obj):
return None

def get_is_based_on_alertmanager(self, obj):
return obj.has_alertmanager_payload_structure
return obj.based_on_alertmanager

# Override method to pass field_name directly in set_value to handle None values for WritableSerializerField
def to_internal_value(self, data):
Expand Down
Loading

0 comments on commit 1ccb9d6

Please sign in to comment.