Skip to content

3.1.3 Monitoring metrics and SLOs

robert-sanfeliu edited this page Oct 16, 2024 · 3 revisions

Introduction

The purpose of this wiki page is to guide NebulOuS adopters in defining the monitoring metrics which shall be used as part of their application. It is not intended to provide a complete implementation of the tasks that should carried out by NebulOuS adopters, but rather provide an overview of these tasks. Following the instructions set out here, a NebulOuS adopter (from now on called a user) is able to make use of the monitoring and adaptation capabilities which are offered by the platform.

Illustrative scenario

In order to make this document more comprehensible, the KubeVela scenario [1] will be used as an example whenever this is deemed necessary. Briefly, this scenario covers an application which receives video streams through Kafka from relevant video sources, and then performs face detection on them.

[1] https://gitlab.ubitech.eu/nebulous/use-cases/surveillance-dsl-demo/-/tree/master

Monitoring metrics within reconfigurations

To benefit from the automated reconfiguration capabilities of NebulOuS, a user should first consider the circumstances under which the application should reconfigure (i.e adapt) itself, and the type of adaptation that is considered to be beneficial. This understanding should also be affirmed by the estimation of the Utility function of the application which should guide the steady-state operation of the application. Naturally, while the user first considers the impact of particular context parameters on the application in general, it is necessary to formulate their impact on a per-component basis.

In our example, these considerations lead us to declare that the application performs well only when all its components perform well; this implies that per-component, particular metrics should respect certain constraints. Taking one of these declarations (see the README.md file), the application owner states for the Face Detection component that "If CPU Util Percentage is greater than 80%, migrate, scale out or add more CPU cores".  This includes an action following the triggering of valid SLO,  which consists of the metric to be monitored (CPU Util percentage), the threshold (80%) and the comparison operator (>). In Nebulous the action is not a part of SLO declaration, but should rather be inferred by the solvers, based on the Utility function of the application.

It is easy to observe that - in a top-down fashion - when the circumstances leading to a reconfiguration have been ascertained (here the inability of the hardware to process locally more images), the choice of the monitoring metric is easier. In the above-mentioned scenario, this implies that a metric called CPU Utilization is available.

Definition of Monitoring metrics

To effectively monitor and optimize applications, NebulOuS employs a robust system of metrics gathered through software probes. These metrics fall into three distinct categories:

Hardware-Level Metrics encompass the computing resources consumed by application components, such as CPU usage, RAM utilization, and disk activity. NebulOuS automatically collects and tracks these metrics, which are vital for establishing the objective function of any application. For a comprehensive list of these metrics, please refer to this link (TODO: add link).

Application-Level Metrics are intricately tied to the application's logic and functionality. For instance, a face detection application might want to monitor and publish the number of faces detected per frame. Similarly, an application providing a REST API might be interested in tracking and publishing the processing time for each request or the number of active sessions. NebulOuS offers application developers the libraries necessary for publishing these metrics (TODO: add link).

Data-Flow Metrics For applications utilizing the NebulOuS MQTT broker, a set of metrics is automatically captured. These metrics pertain to the data flow within the system, including details such as the number of pending messages on a topic, the time it takes for an MQTT consumer to receive and acknowledge a message, and more. The exact list of metrics collected can be found here.

Using monitoring metrics

(TBD: Describe how metrics are used in SLO/Utility function definition).

Templates

A template, is a modelling utility which allows a Nebulous user to define properties which are common across a class of metrics. For example, 'percentage'-based metrics all have a minimum value of 0, a maximum value of 100 and may have an int or a float type. A measurement unit can also be defined, although no conversion between measurement units happens within Nebulous. Templates can be defined by completing the following fields:

image

(TBD: Document how templates can be applied to metrics)

Parameters

Parameters are used to assign values which are considered to be static within a metric model (e.g. , the value of pi=3.14159...). The user interface allows the definition of parameters as illustrated in the next figure:

image

Metrics

Metrics are the main constituents of the metric model, and their definition is at the heart of any monitoring functionality within Nebulous. They must be assigned with a Name, which will be used to refer to them, which is associated to sensor information. This information includes the framework which is used to get sensor data (either Prometheus or Netdata) and the name of the monitoring metric within this sensing framework - alternatively it could contain only the name of the sensing technology (e.g prometheus) and the name of the monitoring metric within this sensing framework as well as other details could be added in the config keymap.

Using the Netdata sensing framework

To illustrate, let's assume that we would like to model cpu consumption in Kubernetes by using Netdata. In this scenario, the Name could be cpu_consumption and the Sensor field could be populated with the value netdata k8s.cgroup.cpu. Alternatively, the Sensor field could be populated only with the word netdata and the Config keymap could be configured to have the key scope_contextswith the value k8s.cgroup.cpu.

Additional keys that can be defined for Netdata, are endpoint (the url suffix to get the data) the dimension field (indicating the desired Netdata dimensions), the after field (indicating that the desired information only involves Linux epoch timestamps greater than this value), the group field which indicates the aggregation function (e.g average), the format field that indicates the manner in which data are to be exported (e.g ssv).

Using the Prometheus sensing framework

To illustrate, let's assume that we would like to model the response time of a component by using Prometheus. In this scenario, the Name could be response_time and the Sensor field could be populated with the value prometheus request_processing_seconds_sum. Alternatively, the Sensor field could be populated only with the word prometheus and the Config keymap could be configured to have the key metric with the value request_processing_seconds_sum.

Additional keys that can be defined for Prometheus, are endpoint (the url suffix to get the data, or /), delay (instructing the collection of values to be delayed by an amount of seconds), intervalPeriod (the integer period over which the metric will be collected) and the intervalUnit (a choice between DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS,MILLIS, MICROS, NANOS - upper or lower case is irrelevant).

Moreover, for all metrics the level of collection can be either set to 'Per component' or 'Global'.

image

Metrics can be either raw or composite. Raw metrics are monitoring metrics which have values that are not the result of any aggregation of the Nebulous EMS. Composite metrics on the other hand are defined as metrics which report an aggregation over one or more raw metrics. To define composite metrics, raw metrics need to be defined first.

The definition of raw metrics includes apart from the Name and the Sensor fields, as well as the Config keymap, the capability to select the output interval type (All, Single, Last), the duration of the interval to publish those raw metrics at, and the unit of the raw metrics.

The definition of composite metrics includes apart from the Name and the Sensor fields, as well as the Config keymap, the capability to select the output interval type (All, Sliding), the duration of the interval to publish those raw metrics at, and the unit of the raw metrics. Moreover, it is possible to specify those details for input intervals as well (the options for the interval types are batch or sliding). If we have a composite metric that depends on a raw metric, and the raw metric is per component level, the composite metric must be at component level too.

image

SLOs

SLOs defined as part of Nebulous follow a hierarchical structure, of AND/OR separated rules, each separated with a comparison operator from a numerical value. In the example below, we can observe a composite SLO rule stating that if (cpu_usage>=70%) or (requests_per_second>10 and ram_usage>=70%) then trigger an SLO Violation. The metrics which can be used in SLOs are to be chosen from the metrics which have been defined in one of the previous steps.

image

Clone this wiki locally