A set of scripts to run basic checks on an OpenShift cluster. PRs welcome!
⚠️ This is an unofficial tool, don't blame us if it breaks your cluster
$ ./openshift-checks.sh -h
Usage: openshift-checks.sh [-h]
This script will run a minimum set of checks to an OpenShift cluster
Available options:
-h, --help Print this help and exit
-v, --verbose Print script debug info
-l, --list Lists the available checks
-s <script>, --single <script> Executes only the provided script
--no-info Disable cluster info commands (default: enabled)
--no-checks Disable cluster check commands (default: enabled)
--prechecks path/to/install-config.yaml Executes only prechecks (default: disabled)
With no options, it will run all checks and info commands with no debug info
There is an automated container build configured with the content of this repository main branch available at quay.io/rhsysdeseng/openshift-checks.
You can use it with your own kubeconfig
file and with the parameters required
as:
$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest -h
You can even create a handy alias:
$ alias openshift-checks="podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest"
Then, simply run it as:
$ openshift-checks -s info/00-clusterversion
Using default/api-foobar-example-com:6443/system:admin context
...
You can build your own container with the included Containerfile:
$ podman build --tag foobar/openshiftchecks .
STEP 1: FROM registry.access.redhat.com/ubi8/ubi:latest
...
$ podman push foobar/openshiftchecks
...
Then, run it by replacing
quay.io/repository/rhsysdeseng/openshift-checks:latest
with your own image
such as foobar/openshiftchecks:latest
:
$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig foobar/openshiftchecks:latest -h
Usage: openshift-checks.sh [-h]
...
The checks can be scheduled to run periodically in an OpenShift cluster by creating a CronJob.
Check the cronjob.yaml example.
The openshift-checks.sh
script is just a wrapper around bash scripts located
in the info or checks folders.
Script | Description |
---|---|
alertmanager | Checks if there are warning or error alerts firing |
chronyc | Checks if the worker clocks are synced using chronyc |
clusterversion_errors | Checks if there are clusterversion errors |
csr | Checks if there are pending csr |
ctrlnodes | Checks if any controller nodes have had the NoSchedule taint removed |
entropy | Checks if the workers have enough entropy |
iptables-22623-22624 | Checks if the nodes iptables rules are blocking 22623/tpc or 22624/tcp |
mcp | Checks if there are degraded mcp |
nodes | Checks if there are not ready or not schedulable nodes |
notrunningpods | Checks if there are not running pods |
operators | Checks if there are operators in 'bad' state |
restarts | Checks if there are pods restarted > n times (10 by default) |
terminating | Checks if there are pods terminating |
Script | Description |
---|---|
clusterversion | Show the clusterversion |
clusteroperators | Show the clusteroperators |
nodes | Show the nodes status |
pods | Show the pods running in the cluster |
biosversion | Show the nodes' BIOS version |
ethtool-firmware-version | Show the nodes' NIC firmware version using ethtool |
mellanox-firmware-version | Show the nodes' Mellanox Connect-4 firmware version |
intel-firmware-version | Reports firmware versions of supported Intel cards if any are found |
mtu | Show the nodes' MTU for some interfaces |
node-versions | Show node components versions such as kubelet, crio, kernel, etc. |
ovs-hostnames | Show the ovs database chassis hostnames |
Script | Description |
---|---|
install-config-valid-yaml | Checks if the install-config.yaml file is a valid yaml file |
Environment variable | Default value | Description |
---|---|---|
OCDEBUGIMAGE | registry.redhat.io/rhel8/support-tools:latest | Used by oc debug . |
OSETOOLSIMAGE | registry.redhat.io/openshift4/ose-tools-rhel8:latest | Used by oc debug in ethtool-firmware-version |
RESTART_THRESHOLD | 10 | Used by the restarts script. |
INTEL_IDS | 8086:158b | Intel device IDs to check for firmware. Can be overridden for non-supported NICs. |
The current intel-firmware-version and mellanox-firmware-version checks only check the firmware version of the SRIOV operator supported NICs (in 4.6).
You can add your own device ID if needed by modifying the script (hint, the
variable is called IDS
and the format is vendorID_A:deviceID_A vendorID_B:deviceID_B
)
Add a new script to get some information or to perform some check in the proper folder and create a pull request.
You can pipe the script to mail
and if there are any errors, an email will be
sent.
First you can configure postfix (already included in RHEL8) as relay host (see https://access.redhat.com/solutions/217503). As an example:
- Append the following settings in
/etc/postfix/main.cf
:
myhostname = kni1-bootstrap.example.com
relayhost = smtp.example.com
- Restart the postfix service:
sudo systemctl restart postfix
- Test it:
echo "Hola" | mail -s 'Subject' johndoe@example.com
Then, run the script as:
/openshift-checks.sh > /tmp/oc-errors 2>&1 || mail -s "Something has failed" johndoe@example.com < /tmp/oc-errors
As a bonus you can include this in a cronjob for periodic checks.