Skip to content

Commit

Permalink
Merge pull request #9 from lsst-sqre/tickets/DM-32532C
Browse files Browse the repository at this point in the history
DM-35232C: add Roundtable disk check
  • Loading branch information
athornton authored Jun 22, 2022
2 parents 18a20f8 + e89a00f commit 8c5b3f8
Show file tree
Hide file tree
Showing 3 changed files with 76 additions and 14 deletions.
35 changes: 25 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
# rubin-influx-tools

## Bucket and task nomenclature

Our `monitoring` InfluxDBv2 instance assumes that any bucket whose name
neither starts with nor ends with an underscore represents a Kubernetes
application bucket. Bucket names ending with an underscore are for
measurements that do not pertain to a single Kubernetes application;
some, like `multiapp_`, are used to collect measurements across multiple
applications, while others like `roundtable_internal_` measure
host-level resource usage from Roundtable itself rather than the
satellite RSP instances being monitored.

## Bucketmapper

### Motivation
Expand Down Expand Up @@ -39,10 +50,10 @@ whose `main()` method, when supplied with an *admin* `INFLUXDB_TOKEN`
will create a new token with sufficient permissions to create tasks for
new application buckets it finds, but not full admin rights. This only
has to be run once per InfluxDB v2 installation, but does need to
precede running `restartmapper`, since the generated token is the one
`restartmapper` should use.
precede running `taskmaker`, since the generated token is the one
`taskmaker` should use.

## Restartmapper
## Taskmaker

### Motivation

Expand All @@ -53,14 +64,18 @@ error-prone to do manually.

### Implementation

[restartmapper](./src/rubin_influx_tools/restartmapper.py) is a Python 3 class
[taskmaker](./src/rubin_influx_tools/taskmaker.py) is a Python 3 class
whose `main()` function queries the buckets in an organization to find
K8s applications, checks to see whether each bucket is matched to a task
to watch that application for pod restarts, and creates any tasks it
finds missing.

Currently creation of the subsequent check and alert notification rules
is manual.
K8s applications, and then determines whether there's a bucket to collect
cross-application results and creates it if necessary. Then it checks
to see whether each application bucket is matched to a task to watch
that application for pod restarts, creates any tasks it needs to.
Finally, it creates tasks to periodically check the cross-application
bucket (called `multiapp_`) for entries indicating that it needs to send
an alert to Slack.

The Slack webhook URL is stored within InfluxDB2 as a (manually-created)
secret.

## Configuration

Expand Down
18 changes: 14 additions & 4 deletions src/rubin_influx_tools/taskmaker.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
slack_timing = {
"restart": {"every": "1m", "offset": "30s"},
"memory_check": {"every": "5m", "offset": "43s"},
"disk_check": {"every": "5m", "offset": "56s"},
}


Expand Down Expand Up @@ -73,7 +74,7 @@ async def construct_tasks(self) -> None:
extant_tasks = await self.list_tasks()
self._extant_tnames = [x.name for x in extant_tasks]

for ttype in "restart", "memory_check":
for ttype in "restart", "memory_check", "disk_check":
await self.construct_named_tasks(ttype)
await self.construct_slack_task(ttype)

Expand All @@ -96,26 +97,35 @@ async def construct_slack_task(self, ttype: str) -> None:
tname = f"_slack_notify_{ttype}s"
if tname in self._extant_tnames:
return
task_text = get_template(ttype, template_marker="_slack.flux")
task_template = get_template(ttype, template_marker="_slack.flux")
task = TaskPost(
description=tname,
org=self.org,
orgID=self.org_id,
status="active",
flux=task_text.render(offset=offset, every=every, taskname=tname),
flux=task_template.render(
offset=offset, every=every, taskname=tname
),
)
payload = [asdict(task)]
url = f"{self.api_url}/tasks"
await self.post(url, payload)

async def build_tasks(self, apps: List[str], ttype: str) -> List[TaskPost]:
"""Create a list of task objects to post."""
task_template = get_template(ttype)
try:
task_template = get_template(ttype, template_marker="_tmpl.flux")
except FileNotFoundError:
# This is OK for disk checks
self.log.warning(f"No application task template for '{ttype}'")
return []
tasks = []
offset = 0
for app in apps:
offset %= 60
offsetstr = seconds_to_duration_literal(offset)
if offsetstr == "infinite":
offsetstr = "0s"
taskname = f"{app.capitalize()} {ttype}s"
tasks.append(
TaskPost(
Expand Down
37 changes: 37 additions & 0 deletions src/rubin_influx_tools/templates/disk_check_slack.flux
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import "slack"
import "influxdata/influxdb/secrets"

option v = {bucket: "_monitoring", timeRangeStart: -1h, timeRangeStop: now(), windowPeriod: 10000ms}

option task = {name: "{{taskname}}", every: {{every}}, offset: {{offset}}}

slackurl = secrets.get(key: "slack_notify_url")
toSlack = slack.endpoint(url: slackurl)

colorLevel = (v) => {
color =
if float(v: v) > 95.0 then
"danger"
else if float(v: v) >= 85.0 then
"warning"
else
"good"

return color
}

from(bucket: "roundtable_internal_")
|> range(start: -2m)
|> filter(fn: (r) => r["_measurement"] == "disk")
|> filter(fn: (r) => r["_field"] == "used_percent")
|> group(columns: ["_time"])
|> filter(fn: (r) => r._value > 85.0)
|> toSlack(
mapFn: (r) =>
({
channel: "roundtable-test-notifications",
text: "${r.host}: ${r.path} at ${r._value}% used",
color: colorLevel(v: r._value),
}),
)()
|> yield()

0 comments on commit 8c5b3f8

Please sign in to comment.