-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Welcome to the tower-nagios-integration wiki!
This repository will contain various documentation and scripts that help integrate Ansible Tower with Nagios.
Script to be used as Nagios event handler to trigger jobs in Ansible Tower.
Lots of people use event handlers in Nagios as a way to preemptively fix problems even before alerting anyone. There are some limitations on how those handlers can/should be deployed, and what kind of actions they can execute. Using Ansible Tower to execute recovery tasks gives a lot more flexibility, and provide better integration with the established automation environment. On top of that, lots of statistics can be easily generated by using the internal Tower capabilities.
This script runs on the Nagios server, and uses tower-cli
to trigger jobs in Ansible Tower. Since those jobs are standard Ansible playbooks running from within Ansible Tower, they can easily be used as a service self-healing method, by running the playbooks which your operations or DevOps team would already use to recover the service. On top of that, since those playbooks run outside the failed host, they can be used to reboot, re-provision or even auto-scale (given your Ansible Tower has already been properly configured for those tasks).
Red Hat IT developed this script in order to reduce the burden of the operations team, by automatically fixing problems without human intervention, and speeding up the time to recover.
This is note a silver bullet, it will not solve all your problems. It is merely a tool to help you automate your event management and service recovery.
- Python 2.7
- Nagios 3.5 or higher
- Ansible Tower 3.2 or higher
By the time that you arrived here, you may already have everything you need to run this script. We will list the requirements here, but this document does not intend to explain how to achieve these. Please refer to the specific documentation of the given technology used.
- Ansible Tower
- Username/password to be used by Nagios.
- At least one inventory and one job template.
- It's highly advisable that your job template have the inventory "prompt on launch" check box marked, however it's not required.
- Nagios
-
tower-cli installed and configured with the proper credentials.
- HINT: On RHEL7 you can install
python2-ansible-tower-cli
from EPEL
- HINT: On RHEL7 you can install
-
tower-cli installed and configured with the proper credentials.
Copy tower_handler.py
into the directory where your event handler scripts should run (as defined by your configuration).
First of all, make sure tower-cli
is working properly. The minimum viable test is this:
# tower-cli job list
===== ============ ======================== ========== =======
id job_template created status elapsed
===== ============ ======================== ========== =======
1 1 2018-10-03T18:30:00.000Z successful 42.000
===== ============ ======================== ========== =======
To confirm if the handler itself is working, you can trigger a job from the command line:
# /path/to/tower_handler.py --template <my_template> --inventory <my_inventory> --attempt 2
If successful, the script will not produce any return, but you will see a job on your Ansible Tower Jobs tab (or in the job list, if you repeat the command above).
Even though this script has been written to be used as a Nagios event handler, it can also be used from the command line (even though it's a little more complicated than using tower-cli directly).
It's important to know all the available command line options, because you will need to know them in order to define your own Nagios handlers. Depending on how you use those options will make it easier or harder to consume the handler.
# /path/to/tower_handler.py --help
usage: tower_handler.py [-h] --template TEMPLATE --inventory INVENTORY
[--playbook PLAYBOOK] [--extra_vars EXTRA_VARS]
[--limit LIMIT] [--state STATE] [--attempt ATTEMPT]
[--downtime DOWNTIME] [--host_downtime DOWNTIME]
[--service SERVICE]
[--hostname HOSTNAME] [--warning]
optional arguments:
-h, --help show this help message and exit
--template TEMPLATE Job template (number or name)
--inventory INVENTORY
Inventory (number or name)
--playbook PLAYBOOK Playbook to run (yaml file inside template)
--extra_vars EXTRA_VARS
Extra variables (JSON)
--limit LIMIT Limit run to these hosts (group name, or comma
separated hosts)
--state STATE Nagios check state
--attempt ATTEMPT Nagios check attempt
--downtime DOWNTIME Nagios service downtime check
--host_downtime DOWNTIME Nagios host downtime check
--service SERVICE Nagios alerting service
--hostname HOSTNAME Nagios alerting hostname
--warning Trigger on WARNING (otherwise just CRITICAL and
UNKNOWN)
There are many ways to configure Nagios to use this script. We will present here some suggestions.
This will trigger the job run against all the hosts on the specified inventory.
/etc/nagios/conf.d/eventhandlers.cfg
define command {
command_name tower-handler-min
# when playbook does not require extra_vars, and you want to run on full inventory
command_line $HANDLERS$/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --host_downtime '$HOSTDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$'
}
/etc/nagios/hosts.d/server01.example.com.cfg
define service {
use generic-service
host_name server01.example.com
service_description MyAppService
contact_groups it-production
check_command check_myappservice
event_handler tower-handler-min!My Template!My Inventory
}
This allows the use of all parameters during the handler call, which provides more information to the job template, allowing fore more precise action.
/etc/nagios/conf.d/eventhandlers.cfg
define command {
command_name tower-handler-full
command_line $HANDLERS$/tower_handler.py --state '$SERVICESTATE$' --attempt '$SERVICEATTEMPT$' --downtime '$SERVICEDOWNTIME$' --host_downtime '$HOSTDOWNTIME$' --service '$SERVICEDESC$' --hostname '$HOSTADDRESS$' --template '$ARG1$' --inventory '$ARG2$' --extra_vars '$ARG3$' --limit '$ARG4$'
}
/etc/nagios/hosts.d/server01.example.com.cfg
define service {
use generic-service
host_name server01.example.com
service_description MyAppService
contact_groups it-production
check_command check_myappservice
event_handler tower-handler-full!My Template!My Inventory!my_variable: value!<fqdn>"
}
Note: in this case, <fqdn>
can be either the host itself, or a totally different host, as long as it exists in the inventory.
- Run against the host itself -- By adding
--limit '$HOSTADDRESS$'
to the command definition, the job will run only against the host which called the handler. - Run in WARNING state -- By default, the script only runs when the alert is in CRITICAL or UNKNOWN state. Adding
--warning
to the command definition will allow it to trigger during a WARNING state.