Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uyuni Health Check Tool Disconnected Solution #9322

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions health-check/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
build
dist
.eggs
*.egg-info
logcli-linux-amd64
promtail-linux-amd64
__pycache__
**/config/exporter/config.yaml
**/config/promtail/config.yaml
**/config/grafana/dashboards/supportconfig_with_logs.json

.vscode/
41 changes: 41 additions & 0 deletions health-check/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
### uyuni-health-check

A tool providing dashboard, metrics and logs from an Uyuni server supportconfig to visualise its health status.

## Requirements

* `python3`
* `podman`

## Building and installing

Install the tool locally into a virtual environment:

```
python3 -m venv venv
. venv/bin/activate
pip install .
```

## Getting started

This tool builds and deploys the necessary containers to scrape some metrics and logs from an Uyuni server supportconfig directory.
Execute the `run` phase of the tool as such:

```
uyuni-health-check -s ~/path/to/supportconfig run --logs --from_datetime=2024-01-01T00:00:00Z --to_datetime=2024-06-01T20:00:00Z
```

This will create and start the following containers locally:

- uyuni-health-exporter (port `9000`)
- grafana (port `3000`)
- loki (port `9100`)
- promtail (port `9081`)

After you start the containers, visit `localhost:3000` and select the `Supportconfig with Logs` dashboard.
If necessary, the default username/password for Grafana is `admin:admin`.

## Security notes
After running this tool, and until containers are destroyed, the Grafana Dashboards (and other metrics) are exposing metrics and logs messages that may contain sensitive data and information to any non-root user in the system or to anyone that have access to this host in the network.

43 changes: 43 additions & 0 deletions health-check/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SPDX-FileCopyrightText: 2023 SUSE LLC
#
# SPDX-License-Identifier: Apache-2.0

[project]
name = "uyuni-health-check"
description = "Show Uyuni server health metrics and logs"
readme = "README.md"
requires-python = ">=3.6"
classifiers = [
"Programming Language :: Python :: 3",
"Operating System :: OS Independent",
]
dependencies = [
"Click",
"rich",
"requests",
"Jinja2",
"PyYAML",
]
maintainers = [
{name = "Pablo Suárez Hernández", email = "psuarezhernandez@suse.com"},
]
dynamic = ["version"]

[project.urls]
homepage = "https://github.com/uyuni-project/uyuni"
tracker = "https://github.com/uyuni-project/uyuni/issues"

[project.scripts]
uyuni-health-check = "uyuni_health_check.main:main"

[tool.setuptools]
package-dir = {"" = "src"}

[build-system]
requires = [
"setuptools>=42",
"setuptools_scm[toml]",
"wheel",
]
build-backend = "setuptools.build_meta"

Empty file.
11 changes: 11 additions & 0 deletions health-check/src/uyuni_health_check/config.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[podman]
network_name = health-check-network

[loki]
loki_container_name = uyuni_health_check_loki
loki_port = 3100
jobs = cobbler,postgresql,rhn,apache

[logcli]
logcli_container_name = uyuni_health_check_logcli
logcli_image_name = logcli
199 changes: 199 additions & 0 deletions health-check/src/uyuni_health_check/config/grafana/alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
apiVersion: 1
groups:
- orgId: 1
name: alert-eval
folder: alerts
interval: 1m
rules:
- uid: be6llqu083474c
title: Likely Salt Performance issues - SaltReqTimeout
condition: B
data:
- refId: A
queryType: range
relativeTimeRange:
from: 2592000
to: 0
datasourceUid: P8E80F9AEF21F6940
model:
direction: backward
editorMode: builder
expr: sum(count_over_time({job="salt"} |~ `(?i)SaltReqTimeoutError` [$__auto]))
intervalMs: 1000
legendFormat: ""
maxDataPoints: 43200
queryType: range
refId: A
step: ""
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 10
- 0
type: gt
operator:
type: and
query:
params: []
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: A
intervalMs: 1000
maxDataPoints: 43200
refId: B
type: threshold
noDataState: OK
execErrState: Error
for: 1m
annotations:
summary: We detected more than 10 `SaltReqTimeout` errors in the logs in the past 1 month. This is likely indicative of Salt performance issues.
labels:
component: salt
issue: performance_issue
isPaused: false
- uid: ee6lmldzgwf0gd
title: Issues that potentially degrade Salt performance
condition: B
data:
- refId: A
queryType: instant
relativeTimeRange:
from: 2592000
to: 0
datasourceUid: P8E80F9AEF21F6940
model:
editorMode: builder
expr: sum(count_over_time({job="salt"} |~ `(?i)an extra return was detected|the public keys did not match|Event with bad payload received|Received minion error from` [$__auto]))
intervalMs: 1000
maxDataPoints: 43200
queryType: instant
refId: A
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 150
- 0
type: gt
operator:
type: and
query:
params: []
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: A
intervalMs: 1000
maxDataPoints: 43200
refId: B
type: threshold
noDataState: NoData
execErrState: Error
for: 1m
annotations:
summary: "We found more than 150 of \"an extra return was detected\", \"the public keys did not match\", \"Event with bad payload received\", or \"Received minion error from\" messages in the logs over the past month. \n\nThese issues might be decreasing Salt performance."
labels:
component: salt
issue: performance_issue
isPaused: false
- uid: ce6i8dhdhj400e
title: More worker threads than CPUs
condition: C
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: infinity
model:
columns: []
datasource:
type: yesoreyeram-infinity-datasource
uid: infinity
filters: []
format: table
global_query_id: ""
hide: false
intervalMs: 1000
maxDataPoints: 43200
parser: backend
refId: A
root_selector: salt_configuration[name="worker_threads"].value
source: url
type: json
url: http://uyuni_health_check_supportconfig-exporter:9000/metrics.json
url_options:
data: ""
method: GET
- refId: B
relativeTimeRange:
from: 600
to: 0
datasourceUid: infinity
model:
columns: []
datasource:
type: yesoreyeram-infinity-datasource
uid: infinity
filters: []
format: table
global_query_id: ""
hide: false
intervalMs: 1000
maxDataPoints: 43200
parser: backend
refId: B
root_selector: hw[name="cpu_count"].value
source: url
type: json
url: http://uyuni_health_check_supportconfig-exporter:9000/metrics.json
url_options:
data: ""
method: GET
- refId: C
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params: []
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: $B - $A < 0
hide: false
intervalMs: 1000
maxDataPoints: 43200
refId: C
type: math
noDataState: NoData
execErrState: Error
for: 1m
isPaused: false
16 changes: 16 additions & 0 deletions health-check/src/uyuni_health_check/config/grafana/dashboard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# SPDX-FileCopyrightText: 2023 SUSE LLC
#
# SPDX-License-Identifier: Apache-2.0

apiVersion: 1

providers:
- name: "Dashboard provider"
orgId: 1
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Loading
Loading