Skip to content

Latest commit

 

History

History
103 lines (83 loc) · 9.67 KB

RepetitiveEditing.md

File metadata and controls

103 lines (83 loc) · 9.67 KB

Repetitive editing of alerts, dashboards

At times one needs to change multiple alerts or various queries in several dashboards at once. A change in a metric name can cause something like this. For example the metric generator in kubernetes, kube-state-metrics, occasionally changes a metric's name.

After such a change, the platform owner needs to update all alerts and dashboards with that metric. Searching for that in the Wavefront GUI and updating the queries manually can be very labor intensive, error prone and time consuming. wavectl can help with automating such a global change.

For the sake of an example, let's say because of an upsteam change, all metrics that started with proc. have been renamed to start with host.proc.. Once this upstream change gets deployed, numerous alerts and dashboards will be broken. They will try to display the old metric name and will not show data. In order to quickly fix this problem via wavectl we first pull all alerts and resources that match the proc\. regular expression. The --match option can be used to narrow down the returned set via a regular expression search.

  $ wavectl pull /tmp/RepetitiveEditing/alerts alert --match "proc\."

  $ wavectl pull /tmp/RepetitiveEditing/dashboards dashboard --match "proc\."

See the pulled alerts, dashboards.

  $ find /tmp/RepetitiveEditing -type f
  /tmp/RepetitiveEditing/alerts/1530723441304.alert
  /tmp/RepetitiveEditing/alerts/1530723441442.alert
  /tmp/RepetitiveEditing/alerts/1530723441589.alert
  /tmp/RepetitiveEditing/alerts/1530723443146.alert
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard
  /tmp/RepetitiveEditing/dashboards/octoproxy.dashboard

See the usage of the metrics starting with proc. in pulled alerts, dashboards.

  $ find /tmp/RepetitiveEditing -type f | xargs grep "proc."
  /tmp/RepetitiveEditing/alerts/1530723441304.alert:    "condition": "ts(proc.net.percent,server_type=\"compute-*\" and env=\"live\") > 80",
  /tmp/RepetitiveEditing/alerts/1530723441304.alert:    "displayExpression": "ts(proc.net.percent,server_type=\"compute-*\" and env=\"live\")",
  /tmp/RepetitiveEditing/alerts/1530723441442.alert:    "condition": "ts(proc.stat.cpu.percentage_used,server_type=\"compute-*\" and env=\"live\") > 80",
  /tmp/RepetitiveEditing/alerts/1530723441442.alert:    "displayExpression": "ts(proc.stat.cpu.percentage_used,server_type=\"compute-*\" and env=\"dev\")",
  /tmp/RepetitiveEditing/alerts/1530723441589.alert:    "condition": "ts(proc.meminfo.percentage_swapused,server_type=\"compute-*\" and env=\"live\") > 10",
  /tmp/RepetitiveEditing/alerts/1530723441589.alert:    "displayExpression": "ts(proc.meminfo.percentage_swapused,server_type=\"compute-*\" and env=\"live\")",
  /tmp/RepetitiveEditing/alerts/1530723443146.alert:    "condition": "max(((sum(rate(ts(proc.stat.cpu, namespace=\"collections-service-dev\"  and type=used)), pod_name) /100) / (sum(taggify(ts(kube.metrics.pod_container_resource_requests_cpu_cores, namespace=\"collections-service-dev\"),  tagk, pod, pod_name, \"\", \"\"), pod_name)) * 100)) > 70",
  /tmp/RepetitiveEditing/alerts/1530723443146.alert:    "displayExpression": "(sum(rate(ts(proc.stat.cpu, namespace=\"collections-service-dev\" and type=used)), pod_name) /100) / (sum(taggify(ts(kube.metrics.pod_container_resource_requests_cpu_cores, namespace=\"collections-service-dev\" ),  tagk, pod, pod_name, \"\", \"\"), pod_name)) * 100",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(\"proc.kernel.entropy_avail\", host=${metadata_server})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_server})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_db02})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_database})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_used, ${PerfPod} and host=${eng01})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_used, ${PerfPod} and host=${content01})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(proc.stat.cpu.percentage_iowait, ${PerfPod} and host=${metadata_server})",
  ...

Then using sed replace all occurances of proc. with host.proc.

  $ find /tmp/RepetitiveEditing -type f | xargs sed -i -e 's/proc\./host.proc./g'

Check the changes you have make

  $ find /tmp/RepetitiveEditing -type f | xargs grep "host.proc."
  /tmp/RepetitiveEditing/alerts/1530723441304.alert:    "condition": "ts(host.proc.net.percent,server_type=\"compute-*\" and env=\"live\") > 80",
  /tmp/RepetitiveEditing/alerts/1530723441304.alert:    "displayExpression": "ts(host.proc.net.percent,server_type=\"compute-*\" and env=\"live\")",
  /tmp/RepetitiveEditing/alerts/1530723441442.alert:    "condition": "ts(host.proc.stat.cpu.percentage_used,server_type=\"compute-*\" and env=\"live\") > 80",
  /tmp/RepetitiveEditing/alerts/1530723441442.alert:    "displayExpression": "ts(host.proc.stat.cpu.percentage_used,server_type=\"compute-*\" and env=\"dev\")",
  /tmp/RepetitiveEditing/alerts/1530723441589.alert:    "condition": "ts(host.proc.meminfo.percentage_swapused,server_type=\"compute-*\" and env=\"live\") > 10",
  /tmp/RepetitiveEditing/alerts/1530723441589.alert:    "displayExpression": "ts(host.proc.meminfo.percentage_swapused,server_type=\"compute-*\" and env=\"live\")",
  /tmp/RepetitiveEditing/alerts/1530723443146.alert:    "condition": "max(((sum(rate(ts(host.proc.stat.cpu, namespace=\"collections-service-dev\"  and type=used)), pod_name) /100) / (sum(taggify(ts(kube.metrics.pod_container_resource_requests_cpu_cores, namespace=\"collections-service-dev\"),  tagk, pod, pod_name, \"\", \"\"), pod_name)) * 100)) > 70",
  /tmp/RepetitiveEditing/alerts/1530723443146.alert:    "displayExpression": "(sum(rate(ts(host.proc.stat.cpu, namespace=\"collections-service-dev\" and type=used)), pod_name) /100) / (sum(taggify(ts(kube.metrics.pod_container_resource_requests_cpu_cores, namespace=\"collections-service-dev\" ),  tagk, pod, pod_name, \"\", \"\"), pod_name)) * 100",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(\"host.proc.kernel.entropy_avail\", host=${metadata_server})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_server})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_db02})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_used, ${PerfPod} and host=${metadata_database})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_used, ${PerfPod} and host=${eng01})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_used, ${PerfPod} and host=${content01})",
  /tmp/RepetitiveEditing/dashboards/metadata-perfpod.dashboard:                                    "query": "ts(host.proc.stat.cpu.percentage_iowait, ${PerfPod} and host=${metadata_server})",
  ...

Replace the Wavefront alerts and dashboards using wavectl push

  $ wavectl push /tmp/RepetitiveEditing/alerts alert
  Replaced alert(s):
  ID               NAME                                                       STATUS    SEVERITY    
  1530723441304    Kubernetes - Node Network Utilization - HIGH (Prod)            WARN    
  1530723441442    Kubernetes - Node Cpu Utilization - HIGH (Prod)                WARN    
  1530723441589    Kubernetes - Node Memory Swap Utilization - HIGH (Prod)        WARN    
  1530723443146    Collections Dev High CPU                                       WARN

  $ wavectl push /tmp/RepetitiveEditing/dashboards dashboard
  Replaced dashboard(s):
  ID                  NAME                DESCRIPTION                                      
  metadata-perfpod    Metadata PerfPod    Monitors for testing Metadata in the PerfPods    
  octoproxy           Skynet Octoproxy    One look summary about the load balancer

After these steps all your alerts and dashboards in Wavefront will use the new metric names.

NOTE: Doing local modifications via sed like commands and writing the resulting files to Wavefront may be risky and dangerous. Some unintended changes may be written to Wavefront by mistake. If you want to execute safer local modifications, where you have a better handle on the resulting diff, take a look at the git integration to push command section.