Skip to content

Commit 62e183f

Browse files
committed
docs: add documentation about setting up alarms
Signed-off-by: Guillaume <guillaume.thouvenin@vates.tech>
1 parent 9903379 commit 62e183f

File tree

1 file changed

+217
-0
lines changed

1 file changed

+217
-0
lines changed

doc/content/xapi/alarms/index.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
+++
2+
title = "How to set up alarms"
3+
linkTitle = "Alarms"
4+
+++
5+
6+
# Introduction
7+
8+
In XAPI, alarms are triggered by a Python daemon located at `/opt/xensource/bin/perfmon`.
9+
The daemon is managed as a systemd service and can be configured by setting parameters in `/etc/sysconfig/perfmon`.
10+
11+
It listens on an internal Unix socket to receive commands. Otherwise, it runs in a loop, periodically requesting metrics from XAPI. It can then be configured to generate events based on these metrics. It can monitor various types of XAPI objects, including `VMs`, `SRs`, and `Hosts`. The configuration for each object is defined by writing an XML string into the object's `other-config` key.
12+
13+
The metrics used by `perfmon` are collected by the `xcp-rrdd` daemon. The `xcp-rrdd` daemon is a component of XAPI responsible for collecting metrics and storing them as Round-Robin Databases (RRDs).
14+
15+
A XAPI plugin also exists, providing the functions `stop`, `start`, `restart`, `refresh`, and `debug_mem`. The `stop`, `start`, and `restart` functions appear to be deprecated, as they do not use systemd to manage the state of the `perfmon` daemon.
16+
The `refresh` and `debug_mem` functions send commands through the Unix socket. `refresh` is used when an `other-config` key is added or updated; it triggers the daemon to reread the monitored objects so that new alerts are taken into account. `debug_mem` logs the objects currently being monitored into `/var/log/user.log` as a dictionnary.
17+
18+
# Monitoring and alarms
19+
20+
## Overview
21+
22+
- To get the metrics, `perfmon` requests XAPI by calling: `http://localhost/rrd_updates?session_id=<ref>&start=1759912021&host=true&sr_uuid=all&cf=AVERAGE&interval=60`
23+
- Different consolidation functions can be used like **AVERAGE**, **MIN**, **MAX** or **LAST**. See the details in the next sections for specific objects and how to set it.
24+
- Once retrieve, `perfmon` will check all its triggers and generate alarms if needed.
25+
26+
## Specific XAPI objects
27+
### VMs
28+
29+
- To set an alarm on a VM, you need to write an XML string into the `other-config` key of the object. For example, to trigger an alarm when the CPU usage is higher than 50%, run:
30+
```sh
31+
xe vm-param-set uuid=<UUID> other-config:perfmon='<config> <variable> <name value="cpu_usage"/> <alarm_trigger_level value="0.5"/> </variable> </config>'
32+
```
33+
34+
- Then, you can either wait until the new config is read by the `perfmon` daemon or force a refresh by running:
35+
```sh
36+
xe host-call-plugin host-uuid=<UUID> plugin=perfmon fn=refresh
37+
```
38+
39+
- Now, if you generate some load inside the VM and the CPU usage goes above 50%, the `perfmon` daemon will create a message (a XAPI object) with the name **ALARM**. This message will include a _priority_, a _timestamp_, an _obj-uuid_ and a _body_. To list all messages that are alarms, run:
40+
```sh
41+
xe message-list name=ALARM
42+
```
43+
44+
- You will see, for example:
45+
```sh
46+
uuid ( RO) : dadd7cbc-cb4e-5a56-eb0b-0bb31c102c94
47+
name ( RO): ALARM
48+
priority ( RO): 3
49+
class ( RO): VM
50+
obj-uuid ( RO): ea9efde2-d0f2-34bb-74cb-78c303f65d89
51+
timestamp ( RO): 20251007T11:30:26Z
52+
body ( RO): value: 0.986414
53+
config:
54+
<variable>
55+
56+
<name value="cpu_usage"/>
57+
58+
<alarm_trigger_level value="0.5"/>
59+
60+
</variable>
61+
```
62+
- where the _body_ contains all the relevant information: the value that triggered the alarm and the configuration of your alarm.
63+
64+
- When configuring you alarm, your XML string can:
65+
- have multiple `<variable>` nodes
66+
- use the following values for child nodes:
67+
* **name**: what to call the variable (no default)
68+
* **alarm_priority**: the priority of the messages generated (default '3')
69+
* **alarm_trigger_level**: level of value that triggers an alarm (no default)
70+
* **alarm_trigger_sense**:'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high')
71+
* **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60')
72+
* **alarm_auto_inhibit_period**: num seconds this alarm disabled after an alarm is sent (default '3600')
73+
* **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage', 'get_percent_fs_usage' for 'fs_usage', 'get_percent_log_fs_usage' for 'log_fs_usage','get_percent_mem_usage' for 'mem_usage', & 'sum' for everything else)
74+
* **rrd_regex** matches the names of variables from (xe vm-data-sources-list uuid=$vmuuid) used to compute value (only has defaults for "cpu_usage", "network_usage", and "disk_usage")
75+
76+
### SRs
77+
78+
- To set an alarm on an SR object, as with VMs, you need to write an XML string into the `other-config` key of the SR. For example, you can run:
79+
```sh
80+
xe sr-param-set uuid=<UUID> other-config:perfmon='<config><variable><name value="physical_utilisation"/><alarm_trigger_level value="0.8"/></variable></config>'
81+
```
82+
- When configuring you alarm, the XML string supports the same child elements as for VMs
83+
84+
### Hosts
85+
86+
- As with VMs ans SRs, alarms can be configured by writing an XML string into an `other-config` key. For example, you can run:
87+
```sh
88+
xe host-param-set uuid=<UUID> other-config:perfmon=\
89+
'<config><variable><name value="cpu_usage"/><alarm_trigger_level value="0.5"/></variable></config>'
90+
```
91+
92+
- The XML string can include multiple <variable> nodes allowed
93+
- The full list of supported child nodes is:
94+
* **name**: what to call the variable (no default)
95+
* **alarm_priority**: the priority of the messages generated (default '3')
96+
* **alarm_trigger_level**: level of value that triggers an alarm (no default)
97+
* **alarm_trigger_sense**: 'high' if alarm_trigger_level is a max, otherwise 'low'. (default 'high')
98+
* **alarm_trigger_period**: num seconds of 'bad' values before an alarm is sent (default '60')
99+
* **alarm_auto_inhibit_period**:num seconds this alarm disabled after an alarm is sent (default '3600')
100+
* **consolidation_fn**: how to combine variables from rrd_updates into one value (default is 'average' for 'cpu_usage' & 'sum' for everything else)
101+
* **rrd_regex** matches the names of variables from (xe host-data-source-list uuid=<UUID>) used to compute value (only has defaults for "cpu_usage", "network_usage", "memory_free_kib" and "sr_io_throughput_total_xxxxxxxx") where that last one ends with the first eight characters of the SR UUID)
102+
103+
- As a special case for SR throughput, it is also possible to configure a Host by writing XML into the `other-config` key of an SR connected to it. For example:
104+
```sh
105+
xe sr-param-set uuid=$sruuid other-config:perfmon=\
106+
'<config><variable><name value="sr_io_throughput_total_per_host"/><alarm_trigger_level value="0.01"/></variable></config>'
107+
```
108+
- This only works for that specific variable name, and `rrd_regex` must not be specified.
109+
- Configuration done directly on the host (variable-name, sr_io_throughput_total_xxxxxxxx) takes priority.
110+
111+
## Which metrics are available?
112+
113+
- Accepted name for metrics are:
114+
- **cpu_usage**: matches RRD metrics with the pattern `cpu[0-9]+`
115+
- **network_usage**: matches RRD metrics with the pattern `vif_[0-9]+_[rt]x`
116+
- **disk_usage**: match RRD metrics with the pattern `vbd_(xvd|hd)[a-z]+_(read|write)`
117+
- **fs_usage**, **log_fs_usage**, **mem_usage** and **memory_internal_free** do not match anything by default.
118+
- By using `rrd_regex`, you can add your own expressions. To get a list of available metrics with their descriptions, you can call the `get_data_sources` method for [VM](https://xapi-project.github.io/new-docs/xen-api/classes/vm/), for [SR](https://xapi-project.github.io/new-docs/xen-api/classes/sr/) and also for [Host](https://xapi-project.github.io/new-docs/xen-api/classes/host/).
119+
- A python script is provided at the end to get data sources. Using the script we can, for example, see:
120+
```sh
121+
# ./get_data_sources.py --vm 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
122+
123+
List of data sources related to VM 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
124+
cpu0 | CPU0 usage
125+
cpu_usage | Domain CPU usage
126+
memory | Memory currently allocated to VM
127+
memory_internal_free | Memory used as reported by the guest agent
128+
memory_target | Target of VM balloon driver
129+
...
130+
vbd_xvda_io_throughput_read | Data read from the VDI, in MiB/s
131+
...
132+
```
133+
- You can then set up an alarm when the data read from a VDI exceeds a certain level by doing:
134+
```
135+
xe vm-param-set uuid=5a445deb-0a8e-c6fe-24c8-09a0508bbe21 \
136+
other-config:perfmon='<config><variable> \
137+
<name value="disk_usage"/> \
138+
<alarm_trigger_level value="10"/> \
139+
<rrd_regex value="vbd_xvda_io_throughput_read"/> \
140+
</variable> </config>'
141+
```
142+
- Here is the script that allows you to get data sources:
143+
```python
144+
#!/usr/bin/env python3
145+
146+
import argparse
147+
import sys
148+
import XenAPI
149+
150+
151+
def pretty_print(data_sources):
152+
if not data_sources:
153+
print("No data sources.")
154+
return
155+
156+
# Compute alignment for something nice
157+
max_label_len = max(len(data["name_label"]) for data in data_sources)
158+
159+
for data in data_sources:
160+
label = data["name_label"]
161+
desc = data["name_description"]
162+
print(f"{label:<{max_label_len}} | {desc}")
163+
164+
165+
def list_vm_data(session, uuid):
166+
vm_ref = session.xenapi.VM.get_by_uuid(uuid)
167+
data_sources = session.xenapi.VM.get_data_sources(vm_ref)
168+
print(f"\nList of data sources related to VM {uuid}")
169+
pretty_print(data_sources)
170+
171+
172+
def list_host_data(session, uuid):
173+
host_ref = session.xenapi.host.get_by_uuid(uuid)
174+
data_sources = session.xenapi.host.get_data_sources(host_ref)
175+
print(f"\nList of data sources related to Host {uuid}")
176+
pretty_print(data_sources)
177+
178+
179+
def list_sr_data(session, uuid):
180+
sr_ref = session.xenapi.SR.get_by_uuid(uuid)
181+
data_sources = session.xenapi.SR.get_data_sources(sr_ref)
182+
print(f"\nList of data sources related to SR {uuid}")
183+
pretty_print(data_sources)
184+
185+
186+
def main():
187+
parser = argparse.ArgumentParser(
188+
description="List data sources related to VM, host or SR"
189+
)
190+
parser.add_argument("--vm", help="VM UUID")
191+
parser.add_argument("--host", help="Host UUID")
192+
parser.add_argument("--sr", help="SR UUID")
193+
194+
args = parser.parse_args()
195+
196+
# Connect to local XAPI: no identification required to access local socket
197+
session = XenAPI.xapi_local()
198+
199+
try:
200+
session.xenapi.login_with_password("", "")
201+
if args.vm:
202+
list_vm_data(session, args.vm)
203+
if args.host:
204+
list_host_data(session, args.host)
205+
if args.sr:
206+
list_sr_data(session, args.sr)
207+
except XenAPI.Failure as e:
208+
print(f"XenAPI call failed: {e.details}")
209+
sys.exit(1)
210+
finally:
211+
session.xenapi.session.logout()
212+
213+
214+
if __name__ == "__main__":
215+
main()
216+
```
217+

0 commit comments

Comments
 (0)