Skip to content

Commit

Permalink
Rework Azure VM
Browse files Browse the repository at this point in the history
  • Loading branch information
BzSpi committed Oct 5, 2023
1 parent 943346a commit 3eb6dce
Show file tree
Hide file tree
Showing 17 changed files with 481 additions and 347 deletions.
4 changes: 2 additions & 2 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -706,8 +706,8 @@
|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Azure Virtual Machine heartbeat|X|-|-|-|-|
|Azure Virtual Machine CPU usage|X|X|-|-|-|
|Azure Virtual Machine remaining CPU credit|X|X|-|-|-|
|Azure Virtual Machine cpu|X|X|-|-|-|
|Azure Virtual Machine remaining cpu credit|X|X|-|-|-|


## integration_gcp-bigquery
Expand Down
4 changes: 2 additions & 2 deletions modules/integration_azure-virtual-machine-scaleset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Note the following parameters:

These 3 parameters alongs with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables.tf](variables.tf).
[variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/language/values/variables) make it possible to
customize the detectors behavior to better fit your needs.
Expand Down Expand Up @@ -110,4 +110,4 @@ Next step will be to use signalFx outlier to, for example, check if all VMs in t
* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
* [Splunk Observability integrations](https://docs.splunk.com/Observability/gdi/get-data-in/integrations.html)
* [Azure Monitor metrics](https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-supported#microsoftcomputevirtualmachinescalesets)
* [Azure Monitor metrics](https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-compute-virtualmachinescalesets-metrics)
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
module: "Azure Virtual Machine ScaleSet"
name: heartbeat

transformation: true
aggregation: ".mean(by=['azure_resource_name', 'azure_resource_group_name', 'azure_region'])"

filtering: "filter('resource_type', 'Microsoft.Compute/virtualMachineScaleSets') and filter('primary_aggregation_type', 'true')"

signals:
signal:
metric: "Percentage CPU"
rules:
critical:
...
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
documentations:
- name: Azure Monitor metrics
url: 'https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-supported#microsoftcomputevirtualmachinescalesets'
url: 'https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-compute-virtualmachinescalesets-metrics'

notes: |
Not like the VirtualMachines module, we decided to not monitor CPU on ScaleSet because it's a non sense on something which should autoscale automatically.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ resource "signalfx_detector" "heartbeat" {
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
from signalfx.detectors.not_reporting import not_reporting
base_filter = filter('resource_type', 'Microsoft.Compute/virtualMachineScaleSets') and filter('primary_aggregation_type', 'true')
signal = data('Percentage CPU', filter=base_filter and ${module.filtering.signalflow})${var.heartbeat_aggregation_function}.publish('signal')
not_reporting.detector(stream=signal, resource_identifier=None, duration='${var.heartbeat_timeframe}', auto_resolve_after='${local.heartbeat_auto_resolve_after}').publish('CRIT')
EOF
from signalfx.detectors.not_reporting import not_reporting
base_filtering = filter('resource_type', 'Microsoft.Compute/virtualMachineScaleSets') and filter('primary_aggregation_type', 'true')
signal = data('Percentage CPU', filter=base_filtering and ${module.filtering.signalflow})${var.heartbeat_aggregation_function}${var.heartbeat_transformation_function}.publish('signal')
not_reporting.detector(stream=signal, resource_identifier=None, duration='${var.heartbeat_timeframe}', auto_resolve_after='${local.heartbeat_auto_resolve_after}').publish('CRIT')
EOF

rule {
description = "has not reported in ${var.heartbeat_timeframe}"
Expand All @@ -26,3 +26,4 @@ resource "signalfx_detector" "heartbeat" {

max_delay = var.heartbeat_max_delay
}

Original file line number Diff line number Diff line change
@@ -1,6 +1,22 @@
# Module specific
# heartbeat detector

# Heartbeat detector
variable "heartbeat_notifications" {
description = "Notification recipients list per severity overridden for heartbeat detector"
type = map(list(string))
default = {}
}

variable "heartbeat_aggregation_function" {
description = "Aggregation function and group by for heartbeat detector (i.e. \".mean(by=['host'])\")"
type = string
default = ".mean(by=['azure_resource_name', 'azure_resource_group_name', 'azure_region'])"
}

variable "heartbeat_transformation_function" {
description = "Transformation function for heartbeat detector (i.e. \".mean(over='5m')\")"
type = string
default = ""
}

variable "heartbeat_max_delay" {
description = "Enforce max delay for heartbeat detector (use \"0\" or \"null\" for \"Auto\")"
Expand All @@ -26,20 +42,9 @@ variable "heartbeat_disabled" {
default = null
}

variable "heartbeat_notifications" {
description = "Notification recipients list per severity overridden for heartbeat detector"
type = map(list(string))
default = {}
}

variable "heartbeat_timeframe" {
description = "Timeframe for heartbeat detector (i.e. \"10m\")"
type = string
default = "10m"
}

variable "heartbeat_aggregation_function" {
description = "Aggregation function and group by for heartbeat detector (i.e. \".mean(by=['host'])\")"
type = string
default = ".mean(by=['azure_resource_id'])"
}
9 changes: 5 additions & 4 deletions modules/integration_azure-virtual-machine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Note the following parameters:

These 3 parameters alongs with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables.tf](variables.tf).
[variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/language/values/variables) make it possible to
customize the detectors behavior to better fit your needs.
Expand All @@ -76,8 +76,8 @@ This module creates the following SignalFx detectors which could contain one or
|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Azure Virtual Machine heartbeat|X|-|-|-|-|
|Azure Virtual Machine CPU usage|X|X|-|-|-|
|Azure Virtual Machine remaining CPU credit|X|X|-|-|-|
|Azure Virtual Machine cpu|X|X|-|-|-|
|Azure Virtual Machine remaining cpu credit|X|X|-|-|-|

## How to collect required metrics?

Expand All @@ -97,6 +97,7 @@ Here is the list of required metrics for detectors in this module.

* `CPU Credits Consumed`
* `CPU Credits Remaining`
* `cpu_percent`
* `Percentage CPU`


Expand All @@ -107,4 +108,4 @@ Here is the list of required metrics for detectors in this module.
* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
* [Splunk Observability integrations](https://docs.splunk.com/Observability/gdi/get-data-in/integrations.html)
* [Azure Monitor metrics](https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-supported#microsoftcomputevirtualmachines)
* [Azure Monitor metrics](https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-compute-virtualmachines-metrics)
15 changes: 15 additions & 0 deletions modules/integration_azure-virtual-machine/conf/00-heartbeat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
module: "Azure Virtual Machine"
name: heartbeat

transformation: true
aggregation: ".mean(by=['azure_resource_name', 'azure_resource_group_name', 'azure_region'])"

filtering: "filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))"

signals:
signal:
metric: "Percentage CPU"
rules:
critical:
...
21 changes: 21 additions & 0 deletions modules/integration_azure-virtual-machine/conf/01-cpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
module: "Azure Virtual Machine"
name: "CPU"
filtering: "filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))"
aggregation: ".mean(by=['azure_resource_name', 'azure_resource_group_name', 'azure_region'])"
value_unit: "%"
transformation: true
signals:
signal:
metric: "cpu_percent"
rules:
critical:
threshold: 90
comparator: ">"
lasting_duration: '15m'
major:
threshold: 80
comparator: ">"
lasting_duration: '15m'
dependency: critical
...
25 changes: 25 additions & 0 deletions modules/integration_azure-virtual-machine/conf/02-cpu-credit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
module: "Azure Virtual Machine"
name: "remaining CPU credit"
filtering: "filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))"
aggregation: ".mean(by=['azure_resource_name', 'azure_resource_group_name', 'azure_region'])"
value_unit: "%"
transformation: true
signals:
remaining:
metric: "CPU Credits Remaining"
consumed:
metric: "CPU Credits Consumed"
signal:
formula: (remaining/(remaining+consumed)).scale(100).fill(100)
rules:
critical:
threshold: 15
comparator: "<"
lasting_duration: '5m'
major:
threshold: 30
comparator: "<"
lasting_duration: '5m'
dependency: critical
...
2 changes: 1 addition & 1 deletion modules/integration_azure-virtual-machine/conf/readme.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
documentations:
- name: Azure Monitor metrics
url: 'https://docs.microsoft.com/en-us/azure/azure-monitor/essentials/metrics-supported#microsoftcomputevirtualmachines'
url: 'https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/microsoft-compute-virtualmachines-metrics'
123 changes: 123 additions & 0 deletions modules/integration_azure-virtual-machine/detectors-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
resource "signalfx_detector" "heartbeat" {
name = format("%s %s", local.detector_name_prefix, "Azure Virtual Machine heartbeat")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
from signalfx.detectors.not_reporting import not_reporting
base_filtering = filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))
signal = data('Percentage CPU', filter=base_filtering and ${module.filtering.signalflow})${var.heartbeat_aggregation_function}${var.heartbeat_transformation_function}.publish('signal')
not_reporting.detector(stream=signal, resource_identifier=None, duration='${var.heartbeat_timeframe}', auto_resolve_after='${local.heartbeat_auto_resolve_after}').publish('CRIT')
EOF

rule {
description = "has not reported in ${var.heartbeat_timeframe}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.heartbeat_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.heartbeat_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.heartbeat_runbook_url, var.runbook_url), "")
tip = var.heartbeat_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject_novalue : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.heartbeat_max_delay
}

resource "signalfx_detector" "cpu" {
name = format("%s %s", local.detector_name_prefix, "Azure Virtual Machine cpu")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

viz_options {
label = "signal"
value_suffix = "%"
}

program_text = <<-EOF
base_filtering = filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))
signal = data('cpu_percent', filter=base_filtering and ${module.filtering.signalflow})${var.cpu_aggregation_function}${var.cpu_transformation_function}.publish('signal')
detect(when(signal > ${var.cpu_threshold_critical}, lasting=%{if var.cpu_lasting_duration_critical == null}None%{else}'${var.cpu_lasting_duration_critical}'%{endif}, at_least=${var.cpu_at_least_percentage_critical})).publish('CRIT')
detect(when(signal > ${var.cpu_threshold_major}, lasting=%{if var.cpu_lasting_duration_major == null}None%{else}'${var.cpu_lasting_duration_major}'%{endif}, at_least=${var.cpu_at_least_percentage_major}) and (not when(signal > ${var.cpu_threshold_critical}, lasting=%{if var.cpu_lasting_duration_critical == null}None%{else}'${var.cpu_lasting_duration_critical}'%{endif}, at_least=${var.cpu_at_least_percentage_critical}))).publish('MAJOR')
EOF

rule {
description = "is too high > ${var.cpu_threshold_critical}%"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.cpu_disabled_critical, var.cpu_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.cpu_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.cpu_runbook_url, var.runbook_url), "")
tip = var.cpu_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

rule {
description = "is too high > ${var.cpu_threshold_major}%"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.cpu_disabled_major, var.cpu_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.cpu_notifications, "major", []), var.notifications.major), null)
runbook_url = try(coalesce(var.cpu_runbook_url, var.runbook_url), "")
tip = var.cpu_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.cpu_max_delay
}

resource "signalfx_detector" "remaining_cpu_credit" {
name = format("%s %s", local.detector_name_prefix, "Azure Virtual Machine remaining cpu credit")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

viz_options {
label = "signal"
value_suffix = "%"
}

program_text = <<-EOF
base_filtering = filter('resource_type', 'Microsoft.Compute/virtualMachines') and filter('primary_aggregation_type', 'true') and (not filter('azure_power_state', 'PowerState/stopping', 'PowerState/stopped', 'PowerState/deallocating', 'PowerState/deallocated'))
remaining = data('CPU Credits Remaining', filter=base_filtering and ${module.filtering.signalflow})${var.remaining_cpu_credit_aggregation_function}${var.remaining_cpu_credit_transformation_function}
consumed = data('CPU Credits Consumed', filter=base_filtering and ${module.filtering.signalflow})${var.remaining_cpu_credit_aggregation_function}${var.remaining_cpu_credit_transformation_function}
signal = (remaining/(remaining+consumed)).scale(100).fill(100).publish('signal')
detect(when(signal < ${var.remaining_cpu_credit_threshold_critical}, lasting=%{if var.remaining_cpu_credit_lasting_duration_critical == null}None%{else}'${var.remaining_cpu_credit_lasting_duration_critical}'%{endif}, at_least=${var.remaining_cpu_credit_at_least_percentage_critical})).publish('CRIT')
detect(when(signal < ${var.remaining_cpu_credit_threshold_major}, lasting=%{if var.remaining_cpu_credit_lasting_duration_major == null}None%{else}'${var.remaining_cpu_credit_lasting_duration_major}'%{endif}, at_least=${var.remaining_cpu_credit_at_least_percentage_major}) and (not when(signal < ${var.remaining_cpu_credit_threshold_critical}, lasting=%{if var.remaining_cpu_credit_lasting_duration_critical == null}None%{else}'${var.remaining_cpu_credit_lasting_duration_critical}'%{endif}, at_least=${var.remaining_cpu_credit_at_least_percentage_critical}))).publish('MAJOR')
EOF

rule {
description = "is too low < ${var.remaining_cpu_credit_threshold_critical}%"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.remaining_cpu_credit_disabled_critical, var.remaining_cpu_credit_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.remaining_cpu_credit_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.remaining_cpu_credit_runbook_url, var.runbook_url), "")
tip = var.remaining_cpu_credit_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

rule {
description = "is too low < ${var.remaining_cpu_credit_threshold_major}%"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.remaining_cpu_credit_disabled_major, var.remaining_cpu_credit_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.remaining_cpu_credit_notifications, "major", []), var.notifications.major), null)
runbook_url = try(coalesce(var.remaining_cpu_credit_runbook_url, var.runbook_url), "")
tip = var.remaining_cpu_credit_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.remaining_cpu_credit_max_delay
}

Loading

0 comments on commit 3eb6dce

Please sign in to comment.