Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlertManager v2 #2643

Merged
merged 50 commits into from
Aug 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
4d5fa44
First pass on AlertManager v2
Konstantinov-Innokentii Jul 24, 2023
1a24597
Introduce legacy_prefix package
Konstantinov-Innokentii Jul 24, 2023
a3d41fe
Merge branch 'dev' into grafana_alerting_v2
Konstantinov-Innokentii Jul 25, 2023
bd731f9
Use set literal for INTEGRATION_TYPES
Konstantinov-Innokentii Jul 25, 2023
7d9c9e7
Add migrate endpoint for integration
Konstantinov-Innokentii Jul 25, 2023
1422bd1
Temporary allow creating legacy integrations
Konstantinov-Innokentii Jul 25, 2023
1f685ef
Merge branch 'dev' into amv2
Konstantinov-Innokentii Jul 26, 2023
22f79e6
Update CHANGELOG.md
Konstantinov-Innokentii Jul 26, 2023
802dd63
Merge remote-tracking branch 'origin/amv2' into amv2
Konstantinov-Innokentii Jul 26, 2023
cb18375
Decouple GrafanaAPIView from AlertManagerAPIView
Konstantinov-Innokentii Jul 26, 2023
342cfaf
based_on_alertmanager naming
Konstantinov-Innokentii Jul 26, 2023
06b865a
Hack to keep demo alert working for integration with legacy AM behaviour
Konstantinov-Innokentii Jul 26, 2023
fe903b7
Add migration
Konstantinov-Innokentii Jul 26, 2023
2c4084c
frontend changes
teodosii Jul 27, 2023
629f46c
fixed updateItems by passing the page param
teodosii Jul 27, 2023
13b6a83
isLegacy check
teodosii Jul 27, 2023
4d29d48
Draft docs
Konstantinov-Innokentii Jul 27, 2023
0234c71
Docs iteration
Konstantinov-Innokentii Jul 28, 2023
58d7f59
Docs polishing
Konstantinov-Innokentii Jul 28, 2023
2c7940a
Docs polishing
Konstantinov-Innokentii Jul 28, 2023
f200d95
Add annotations to payload example
Konstantinov-Innokentii Jul 28, 2023
f5cdbf5
Merge branch 'dev' into amv2
Konstantinov-Innokentii Jul 28, 2023
cbe4328
Text polishing
Konstantinov-Innokentii Jul 28, 2023
052ae46
Fix typos
Konstantinov-Innokentii Jul 28, 2023
046636a
Merge remote-tracking branch 'origin/dev' into amv2
Konstantinov-Innokentii Jul 28, 2023
a67f92f
Temporary remove migration
Konstantinov-Innokentii Jul 28, 2023
78ed3ef
Fix Changelog
Konstantinov-Innokentii Jul 28, 2023
05e4ee8
frontend changes
teodosii Jul 28, 2023
d6d91e6
linter
teodosii Jul 28, 2023
fef74ea
ui display changes
teodosii Jul 28, 2023
6e7253d
Polishing
Konstantinov-Innokentii Jul 31, 2023
166cf97
Merge branch 'dev' into amv2
Konstantinov-Innokentii Jul 31, 2023
dc468c1
Fix tests
Konstantinov-Innokentii Jul 31, 2023
54cbfa6
Comments polishing
Konstantinov-Innokentii Jul 31, 2023
c17dd28
Add migration
Konstantinov-Innokentii Jul 31, 2023
d9b70d8
Docs polishing
Konstantinov-Innokentii Jul 31, 2023
38e20c3
Comment polishing
Konstantinov-Innokentii Jul 31, 2023
16da5c1
Refresh templates on migration
Konstantinov-Innokentii Jul 31, 2023
283282d
Merge branch 'dev' into amv2
joeyorlando Jul 31, 2023
f4630bc
Update docs/sources/integrations/alertmanager/index.md
Konstantinov-Innokentii Aug 1, 2023
029abef
Merge branch 'dev' into amv2
Konstantinov-Innokentii Aug 1, 2023
58ffe1b
Rename migrateChannelFiter to migrateChannel
Konstantinov-Innokentii Aug 1, 2023
a228f83
Remove excess function
Konstantinov-Innokentii Aug 1, 2023
c08937c
Skip test_related_shifts
Konstantinov-Innokentii Aug 1, 2023
5fe2dbb
Fix migration
Konstantinov-Innokentii Aug 1, 2023
a5c05ad
Polishing
Konstantinov-Innokentii Aug 1, 2023
b6bc830
Add migration Date on frontend
Konstantinov-Innokentii Aug 1, 2023
237cce1
Fix tests
Konstantinov-Innokentii Aug 1, 2023
4a01a81
Skip test_related_shifts
Konstantinov-Innokentii Aug 1, 2023
595e7ac
Update CHANGELOG.md
Konstantinov-Innokentii Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- [Helm] Add `extraContainers` for engine, celery and migrate-job pods to define sidecars by @lu1as ([#2650](https://github.com/grafana/oncall/pull/2650))
– Rework of AlertManager integration ([#2643](https://github.com/grafana/oncall/pull/2643))

## v1.3.20 (2023-07-31)

Expand Down
64 changes: 59 additions & 5 deletions docs/sources/integrations/alertmanager/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@ weight: 300

# Alertmanager integration for Grafana OnCall

> You must have the [role of Admin][user-and-team-management] to be able to create integrations in Grafana OnCall.
> ⚠️ A note about **(Legacy)** integrations:
> We are changing internal behaviour of AlertManager integration.
> Integrations that were created before version 1.3.21 are marked as **(Legacy)**.
> These integrations are still receiving and escalating alerts but will be automatically migrated after 1 November 2023.
> <br/><br/>
> To ensure a smooth transition you can migrate legacy integrations by yourself now.
> [Here][migration] you can read more about changes and migration process.

The Alertmanager integration handles alerts from [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/).
This integration is the recommended way to send alerts from Prometheus deployed in your infrastructure, to Grafana OnCall.
Expand All @@ -30,16 +36,14 @@ This integration is the recommended way to send alerts from Prometheus deployed
4. A new page will open with the integration details. Copy the **OnCall Integration URL** from **HTTP Endpoint** section.
You will need it when configuring Alertmanager.

<!--![123](../_images/connect-new-monitoring.png)-->

## Configuring Alertmanager to Send Alerts to Grafana OnCall

1. Add a new [Webhook](https://prometheus.io/docs/alerting/latest/configuration/#webhook_config) receiver to `receivers`
section of your Alertmanager configuration
2. Set `url` to the **OnCall Integration URL** from previous section
- **Note:** The url has a trailing slash that is required for it to work properly.
3. Set `send_resolved` to `true`, so Grafana OnCall can autoresolve alert groups when they are resolved in Alertmanager
4. It is recommended to set `max_alerts` to less than `300` to avoid rate-limiting issues
4. It is recommended to set `max_alerts` to less than `100` to avoid requests that are too large.
5. Use this receiver in your route configuration

Here is the example of final configuration:
Expand All @@ -54,7 +58,7 @@ receivers:
webhook_configs:
- url: <integation-url>
send_resolved: true
max_alerts: 300
max_alerts: 100
```

## Complete the Integration Configuration
Expand Down Expand Up @@ -113,10 +117,60 @@ Add receiver configuration to `prometheus.yaml` with the **OnCall Heartbeat URL*
send_resolved: false
```

## Migrating from Legacy Integration

Before we were using each alert from AlertManager group as a separate payload:

```json
{
"labels": {
"severity": "critical",
"alertname": "InstanceDown"
},
"annotations": {
"title": "Instance localhost:8081 down",
"description": "Node has been down for more than 1 minute"
},
...
}
```

This behaviour was leading to mismatch in alert state between OnCall and AlertManager and draining of rate-limits,
since each AlertManager alert was counted separately.

We decided to change this behaviour to respect AlertManager grouping by using AlertManager group as one payload.

```json
{
"alerts": [...],
"groupLabels": {"alertname": "InstanceDown"},
"commonLabels": {"job": "node", "alertname": "InstanceDown"},
"commonAnnotations": {"description": "Node has been down for more than 1 minute"},
"groupKey": "{}:{alertname=\"InstanceDown\"}",
...
}
```

You can read more about AlertManager Data model [here](https://prometheus.io/docs/alerting/latest/notifications/#data).

### How to migrate

> Integration URL will stay the same, so no need to change AlertManager or Grafana Alerting configuration.
> Integration templates will be reset to suit new payload.
> It is needed to adjust routes manually to new payload.

1. Go to **Integration Page**, click on three dots on top right, click **Migrate**
2. Confirmation Modal will be shown, read it carefully and proceed with migration.
3. Send demo alert to make sure everything went well.
4. Adjust routes to the new shape of payload. You can use payload of the demo alert from previous step as an example.

{{% docs/reference %}}
[user-and-team-management]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/user-and-team-management"
[user-and-team-management]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/user-and-team-management"

[complete-the-integration-configuration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations#complete-the-integration-configuration"
[complete-the-integration-configuration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations#complete-the-integration-configuration"

[migration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations/alertmanager#migrating-from-legacy-integration"
[migration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations/alertmanager#migrating-from-legacy-integration"
{{% /docs/reference %}}
67 changes: 62 additions & 5 deletions docs/sources/integrations/grafana-alerting/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ weight: 100

# Grafana Alerting integration for Grafana OnCall

> ⚠️ A note about **(Legacy)** integrations:
> We are changing internal behaviour of Grafana Alerting integration.
> Integrations that were created before version 1.3.21 are marked as **(Legacy)**.
> These integrations are still receiving and escalating alerts but will be automatically migrated after 1 November 2023.
> <br/><br/>
> To ensure a smooth transition you can migrate them by yourself now.
> [Here][migration] you can read more about changes and migration process.

Grafana Alerting for Grafana OnCall can be set up using two methods:

- Grafana Alerting: Grafana OnCall is connected to the same Grafana instance being used to manage Grafana OnCall.
Expand Down Expand Up @@ -53,11 +61,9 @@ Connect Grafana OnCall with alerts coming from a Grafana instance that is differ
OnCall is being managed:

1. In Grafana OnCall, navigate to the **Integrations** tab and select **New Integration to receive alerts**.
2. Select the **Grafana (Other Grafana)** tile.
3. Follow the configuration steps that display in the **How to connect** window to retrieve your unique integration URL
and complete any necessary configurations.
4. Determine the escalation chain for the new integration by either selecting an existing one or by creating a
new escalation chain.
2. Select the **Alertmanager** tile.
3. Enter a name and description for the integration, click Create
4. A new page will open with the integration details. Copy the OnCall Integration URL from HTTP Endpoint section.
5. Go to the other Grafana instance to connect to Grafana OnCall and navigate to **Alerting > Contact Points**.
6. Select **New Contact Point**.
7. Choose the contact point type `webhook`, then paste the URL generated in step 3 into the URL field.
Expand All @@ -66,3 +72,54 @@ OnCall is being managed:
> see [Contact points in Grafana Alerting](https://grafana.com/docs/grafana/latest/alerting/unified-alerting/contact-points/).

8. Click the **Edit** (pencil) icon, then click **Test**. This will send a test alert to Grafana OnCall.

## Migrating from Legacy Integration

Before we were using each alert from Grafana Alerting group as a separate payload:

```json
{
"labels": {
"severity": "critical",
"alertname": "InstanceDown"
},
"annotations": {
"title": "Instance localhost:8081 down",
"description": "Node has been down for more than 1 minute"
},
...
}
```

This behaviour was leading to mismatch in alert state between OnCall and Grafana Alerting and draining of rate-limits,
since each Grafana Alerting alert was counted separately.

We decided to change this behaviour to respect Grafana Alerting grouping by using AlertManager group as one payload.

```json
{
"alerts": [...],
"groupLabels": {"alertname": "InstanceDown"},
"commonLabels": {"job": "node", "alertname": "InstanceDown"},
"commonAnnotations": {"description": "Node has been down for more than 1 minute"},
"groupKey": "{}:{alertname=\"InstanceDown\"}",
...
}
```

You can read more about AlertManager Data model [here](https://prometheus.io/docs/alerting/latest/notifications/#data).

### How to migrate

> Integration URL will stay the same, so no need to make changes on Grafana Alerting side.
> Integration templates will be reset to suit new payload.
> It is needed to adjust routes manually to new payload.

1. Go to **Integration Page**, click on three dots on top right, click **Migrate**
2. Confirmation Modal will be shown, read it carefully and proceed with migration.
3. Adjust routes to the new shape of payload.

{{% docs/reference %}}
[migration]: "/docs/oncall/ -> /docs/oncall/<ONCALL VERSION>/integrations/grafana-alerting#migrating-from-legacy-integration"
[migration]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/oncall/integrations/grafana-alerting#migrating-from-legacy-integration"
{{% /docs/reference %}}
3 changes: 2 additions & 1 deletion engine/apps/alerts/integration_options_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ def __init__(self, *args, **kwargs):
for integration_config in _config:
vars()[f"INTEGRATION_{integration_config.slug.upper()}"] = integration_config.slug

INTEGRATION_TYPES = {integration_config.slug for integration_config in _config}

INTEGRATION_CHOICES = tuple(
(
(
Expand All @@ -39,7 +41,6 @@ def __init__(self, *args, **kwargs):
WEB_INTEGRATION_CHOICES = [
integration_config.slug for integration_config in _config if integration_config.is_displayed_on_web
]
PUBLIC_API_INTEGRATION_MAP = {integration_config.slug: integration_config.slug for integration_config in _config}
INTEGRATION_SHORT_DESCRIPTION = {
integration_config.slug: integration_config.short_description for integration_config in _config
}
Expand Down
37 changes: 37 additions & 0 deletions engine/apps/alerts/migrations/0030_auto_20230731_0341.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 3.2.19 on 2023-07-31 03:41

from django.db import migrations


integration_alertmanager = "alertmanager"
integration_grafana_alerting = "grafana_alerting"

legacy_alertmanager = "legacy_alertmanager"
legacy_grafana_alerting = "legacy_grafana_alerting"


def make_integrations_legacy(apps, schema_editor):
AlertReceiveChannel = apps.get_model("alerts", "AlertReceiveChannel")


AlertReceiveChannel.objects.filter(integration=integration_alertmanager).update(integration=legacy_alertmanager)
AlertReceiveChannel.objects.filter(integration=integration_grafana_alerting).update(integration=legacy_grafana_alerting)


def revert_make_integrations_legacy(apps, schema_editor):
AlertReceiveChannel = apps.get_model("alerts", "AlertReceiveChannel")


AlertReceiveChannel.objects.filter(integration=legacy_alertmanager).update(integration=integration_alertmanager)
AlertReceiveChannel.objects.filter(integration=legacy_grafana_alerting).update(integration=integration_grafana_alerting)


class Migration(migrations.Migration):

dependencies = [
('alerts', '0029_auto_20230728_0802'),
]

operations = [
migrations.RunPython(make_integrations_legacy, revert_make_integrations_legacy),
]
35 changes: 14 additions & 21 deletions engine/apps/alerts/models/alert_receive_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@
from apps.alerts.grafana_alerting_sync_manager.grafana_alerting_sync import GrafanaAlertingSyncManager
from apps.alerts.integration_options_mixin import IntegrationOptionsMixin
from apps.alerts.models.maintainable_object import MaintainableObject
from apps.alerts.tasks import disable_maintenance, sync_grafana_alerting_contact_points
from apps.alerts.tasks import disable_maintenance
from apps.base.messaging import get_messaging_backend_from_id
from apps.base.utils import live_settings
from apps.integrations.legacy_prefix import remove_legacy_prefix
from apps.integrations.metadata import heartbeat
from apps.integrations.tasks import create_alert, create_alertmanager_alerts
from apps.metrics_exporter.helpers import (
Expand Down Expand Up @@ -339,7 +340,8 @@ def is_demo_alert_enabled(self):

@property
def description(self):
if self.integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
# TODO: AMV2: Remove this check after legacy integrations are migrated.
if self.integration == AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING:
contact_points = self.contact_points.all()
rendered_description = jinja_template_env.from_string(self.config.description).render(
is_finished_alerting_setup=self.is_finished_alerting_setup,
Expand Down Expand Up @@ -421,7 +423,8 @@ def integration_url(self):
AlertReceiveChannel.INTEGRATION_MAINTENANCE,
]:
return None
return create_engine_url(f"integrations/v1/{self.config.slug}/{self.token}/")
slug = remove_legacy_prefix(self.config.slug)
return create_engine_url(f"integrations/v1/{slug}/{self.token}/")

@property
def inbound_email(self):
Expand Down Expand Up @@ -552,7 +555,12 @@ def send_demo_alert(self, payload=None):
if payload is None:
payload = self.config.example_payload

if self.has_alertmanager_payload_structure:
# TODO: AMV2: hack to keep demo alert working for integration with legacy alertmanager behaviour.
Copy link
Contributor

@iskhakov iskhakov Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more clear instruction what should be done here and when?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I can't. We still have GRAFANA integration which might or might not have legacy alertmanager behaviour and that's why there is no clear instruction what to do. I'll remove TODO.

if self.integration in {
AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING,
AlertReceiveChannel.INTEGRATION_LEGACY_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
}:
alerts = payload.get("alerts", None)
if not isinstance(alerts, list) or not len(alerts):
raise UnableToSendDemoAlert(
Expand All @@ -573,12 +581,8 @@ def send_demo_alert(self, payload=None):
)

@property
def has_alertmanager_payload_structure(self):
return self.integration in (
AlertReceiveChannel.INTEGRATION_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING,
)
def based_on_alertmanager(self):
return getattr(self.config, "based_on_alertmanager", False)

# Insight logs
@property
Expand Down Expand Up @@ -652,14 +656,3 @@ def listen_for_alertreceivechannel_model_save(
metrics_remove_deleted_integration_from_cache(instance)
else:
metrics_update_integration_cache(instance)

if instance.integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
if created:
instance.grafana_alerting_sync_manager.create_contact_points()
# do not trigger sync contact points if field "is_finished_alerting_setup" was updated
elif (
kwargs is None
or not kwargs.get("update_fields")
or "is_finished_alerting_setup" not in kwargs["update_fields"]
):
sync_grafana_alerting_contact_points.apply_async((instance.pk,), countdown=5)
4 changes: 2 additions & 2 deletions engine/apps/alerts/tests/test_alert_receiver_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,9 @@ def test_send_demo_alert(mocked_create_alert, make_organization, make_alert_rece
@pytest.mark.parametrize(
"integration",
[
AlertReceiveChannel.INTEGRATION_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_LEGACY_ALERTMANAGER,
AlertReceiveChannel.INTEGRATION_GRAFANA,
AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING,
AlertReceiveChannel.INTEGRATION_LEGACY_GRAFANA_ALERTING,
],
)
@pytest.mark.parametrize(
Expand Down
11 changes: 10 additions & 1 deletion engine/apps/api/serializers/alert_receive_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from apps.alerts.models import AlertReceiveChannel
from apps.alerts.models.channel_filter import ChannelFilter
from apps.base.messaging import get_messaging_backends
from apps.integrations.legacy_prefix import has_legacy_prefix
from common.api_helpers.custom_fields import TeamPrimaryKeyRelatedField
from common.api_helpers.exceptions import BadRequest
from common.api_helpers.mixins import APPEARANCE_TEMPLATE_NAMES, EagerLoadingMixin
Expand Down Expand Up @@ -52,6 +53,7 @@ class AlertReceiveChannelSerializer(EagerLoadingMixin, serializers.ModelSerializ
routes_count = serializers.SerializerMethodField()
connected_escalations_chains_count = serializers.SerializerMethodField()
inbound_email = serializers.CharField(required=False)
is_legacy = serializers.SerializerMethodField()

# integration heartbeat is in PREFETCH_RELATED not by mistake.
# With using of select_related ORM builds strange join
Expand Down Expand Up @@ -90,6 +92,7 @@ class Meta:
"connected_escalations_chains_count",
"is_based_on_alertmanager",
"inbound_email",
"is_legacy",
]
read_only_fields = [
"created_at",
Expand All @@ -105,12 +108,15 @@ class Meta:
"connected_escalations_chains_count",
"is_based_on_alertmanager",
"inbound_email",
"is_legacy",
]
extra_kwargs = {"integration": {"required": True}}

def create(self, validated_data):
organization = self.context["request"].auth.organization
integration = validated_data.get("integration")
# if has_legacy_prefix(integration):
# raise BadRequest(detail="This integration is deprecated")
Comment on lines +118 to +119
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be uncommented or removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping this

if integration == AlertReceiveChannel.INTEGRATION_GRAFANA_ALERTING:
connection_error = GrafanaAlertingSyncManager.check_for_connection_errors(organization)
if connection_error:
Expand Down Expand Up @@ -185,6 +191,9 @@ def get_alert_groups_count(self, obj):
def get_routes_count(self, obj) -> int:
return obj.channel_filters.count()

def get_is_legacy(self, obj) -> bool:
return has_legacy_prefix(obj.integration)

def get_connected_escalations_chains_count(self, obj) -> int:
return (
ChannelFilter.objects.filter(alert_receive_channel=obj, escalation_chain__isnull=False)
Expand Down Expand Up @@ -262,7 +271,7 @@ def get_payload_example(self, obj):
return None

def get_is_based_on_alertmanager(self, obj):
return obj.has_alertmanager_payload_structure
return obj.based_on_alertmanager

# Override method to pass field_name directly in set_value to handle None values for WritableSerializerField
def to_internal_value(self, data):
Expand Down
Loading