Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update machine-api-usage-telemetry #1072

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions enhancements/machine-api/machine-api-usage-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,11 +259,14 @@ will need to be exposed through telemetry. These metrics are:
* This should be exported through the `cluster:usage:resources:sum` series with
a resource type of `machinehealthchecks.machine.openshift.io`.
* MachineHealthCheck total nodes covered count
* `mapi_machinehealthcheck_nodes_covered` - This metric has no labels.
* `mapi_machinehealthcheck_nodes_covered` - This metric has two labels representing
the name and namespace of the machine health check.
* MachineHealthCheck successful remediations count
* `mapi_machinehealthcheck_remediation_success_total` - This metric has no labels.
* `mapi_machinehealthcheck_remediation_success_total` - This metric has two labels
representing the name and namespace of the machine health check.
* MachineHealthCheck short circuit state
* `mapi_machinehealthcheck_short_circuit` - This metric has no labels.
* `mapi_machinehealthcheck_short_circuit` - This metric has two labels representing
the name and namespace of the machine health check.

**Metric series to be exported**

Expand All @@ -275,8 +278,8 @@ will need to be exposed through telemetry. These metrics are:
with no labels.
* Total MachineAutoscaler resource count, using `cluster:usage:resources:sum{resource="machineautoscalers.autoscaling.openshift.io"}`.
* Total MachineHealthCheck resource count, using `cluster:usage:resources:sum{resource="machinehealthchecks.machine.openshift.io"}`.
* Total nodes covered by MachineHealthChecks count, using `mapi_machinehealthcheck_nodes_covered` with no labels.
* Total remediations completed by MachineHealthChecks count, using `mapi_machinehealthcheck_remediation_success_total` with no labels.
* Total nodes covered by MachineHealthChecks count, using a sum of all `mapi_machinehealthcheck_nodes_covered` with no labels on the series.
* Total remediations completed by MachineHealthChecks count, using a sum of all `mapi_machinehealthcheck_remediation_success_total` with no labels on the series.

In addition to the metrics defined above, the alerts generated by the Machine
API components will be used to augment this data. The listings below are
Expand All @@ -292,7 +295,7 @@ above.
**These alerts will need to be created**

* machine-api-operator
* MachineWithOldDeletionTimestamp - This alert will fire when a Machine resource
* MachineNotYetDeleted - This alert will fire when a Machine resource
is detected that has deletion timestamp that is older than 6 hours.
* machine health check controller
* MachineHealthCheckUnterminatedShortCircuit - This alert will fire when the
Expand Down