-
Notifications
You must be signed in to change notification settings - Fork 0
3.1.3.2 Publishing system and cluster metrics using Netdata
This method of acquiring system and K8S metric values involves the deployment of one Netdata agent at every K8S cluster node. Netdata is an open source software for collecting metrics, displaying them as charts, but also providing them through a REST API. The default Nebulous application deployment scenario installs Netdata agents along with EMS at application clusters. EPAs will periodically contact the REST API server of each Netdata agent and scrape the required metrics. To enable EPAs scrape the Netdata agents, it is required that the application metric model provides the needed configuration. For each raw metric that will have its values using this method, it is necessary to define a sensor of “netdata” type and provide the corresponding configuration (including the scraping period).
NOTE:
At each Kubernetes cluster node, exactly one instance of EPA and one instance of Netdata agent is deployed, as DaemonSets. At runtime, each EPA queries its collocated Netdata agent. The<NETDATA_IP_ADDRESS>
is the node's IP address and it is provided by Kubernetes through the Downward API.
In order to define a raw metric that takes its values from Netdata agents, the netdata
type must be entered in the Sensor field in Nebulous GUI.
This will instruct EPA to use its K8S Netdata collector plugin for retrieving the values. Under the hood the K8S Netdata collector plugin will build a URL of the form:
http://<NETDATA_IP_ADDRESS>:<PORT>/<PATH>?<QUERY_PARAMS>&format=ssv
and attempt to retrieve the relevant JSON response from there. Following, it will extract the value(s) of the metric of interest (see next) and publish it/them as the raw metric's value(s) in EPA broker. If configured, it will also aggregate multiple values into a single one.
The following screenshot explains how a raw metric collecting values from Netdata can be setup in Nebulous GUI (Metric model editor).
If no configuration is provided (i.e. defaults apply) then the Metric of interest can be provided in the Sensor field, after the sensor type netdata
.
In order to build the URL, the collector plugin will use the provided configuration settings, or the corresponding defaults.
If the metric of interest is a Kubernetes-related metric (its name starts with k8s.
) the collector plugin can take into
consideration the pod name and namespace.
The metric of interest must be provided in the configuration.
The configuration comprises a few settings used to guide the collector plugin, while the remaining are used to build the QUERY_PARAMS
list of the URL.
The plugin-specific configuration settings, along with their respective defaults, are:
Plugin Setting | Type | Default value | Comments |
---|---|---|---|
endpoint | String |
/api/v2/data |
The <PATH> part of the URL. Only the v2 version has been tested. |
port | Port |
19999 |
The <PORT> part of the URL. Allowed values: 1..65535 . |
components | String |
component name | In case of K8S metric of interest, specifies which pod(s) to pick. If left empty it will pick all pods in the namespace. If omitted it will use the name of the component(s) the raw metric applies to. |
namespace | String |
default |
In case of K8S metric of interest, specifies the pod namespace to use. If left empty it will pick matching pods from all namespaces. If omitted it defaults to default namespace. |
results-aggregation | Enum |
no default | Allowed values: SUM , AVERAGE , COUNT , MIN , MAX , NONE . If omitted or is NONE individual events will be published for each metric value. |
intervalPeriod | Positive Integer |
60 |
How often to query Netdata API. |
intervalUnit | Enum |
SECONDS |
The time unit of intervalPeriod . Allowed values: SECONDS , MINUTES , HOURS , DAYS
|
If intervalPeriod
is omitted, the querying period is taken from the raw metric Output interval and unit fields. If they are not specified either, it is assumed to be 60 seconds.
The settings used to build the Netdata URL, along with their respective defaults, are:
Netdata Setting | Type | Default value | Comments |
---|---|---|---|
scope_contexts | String |
no default | REQUIRED: The metric(s) of interest to extract. Can be a comma-separated list |
context | String |
no default | Can be used instead of scope_contexts . Check Netdata documentation for details |
dimension | String |
* |
The scope_context dimensions to use |
after | Long |
-1 |
Selects the measurements taken the last second (before now) |
time_group | Enum |
average |
Defines the method of grouping of multiple measurements |
The settings in the table above are always added in the URL (either the provided value or the default). Any additional settings provided (not listed in tables above) will also be included in query parameters list.
For a complete list of the supported query parameters, and their semantics, please consult the official Netdata documentation on the topic. You can also check Netdata API.
Netdata metrics pertaining to Kubernetes are named using the k8s.
prefix. For instance, k8s.cgroup.cpu
.
When a metric can be measured per pod, there will be one measurement (value) for each pod in every Netdata response (e.g. CPU consumed by each pod).
The K8S Netdata collector plugin will filter these values in order to retain only those complying to the raw metric specification.
NOTE:
The K8S Netdata collector plugin will attempt to extract metric values from the JSON response section underview.dimensions.ids
andview.dimensions.values
. If the metric of interest is a K8S metric, the ids represent different pods (running at various namespaces). Plugin will filter ids (i.e. pods) based on the provided component name(s) and namespace.IMPORTANT:
Pod names may be different than component names (as they appear in the metric model). For instance, using Helm will prepend the deployment name in front of each component name to generate pod names. In this case it is essential to set thecomponents
setting with a value including the Helm deployment prefix (or any other naming deviation).
The following table details the outcome of each possible combination of components
and namespace
configuration settings.
For each pod selected, the corresponding measurement (metric value) will be kept for further processing. The rest will be filtered out.
components | namespace | Pods selected |
---|---|---|
provided | provided | Pod with name included in components list, and running in the namespace specified in namespace setting |
blank | provided | All pods running in the namespace specified in namespace setting |
omitted | provided | Pods named exactly as one of the metric model components where raw metric applies to, and running in the namespace specified in namespace setting |
provided | blank | Pod with name included in components list, running in any namespace |
blank | blank | All pods, running in any namespace |
omitted | blank | Pods named exactly as one of the metric model components where raw metric applies to, and running in any namespace |
provided | omitted | Pod with name included in components list, running in namespace named default
|
blank | omitted | All pods, running in namespace named default
|
omitted | omitted | Pods named exactly as one of the metric model components where raw metric applies to, and running in namespace named default
|
The values retained (each pertaining to one pod), can either:
- be aggregated according to
result-aggregation
value, and then published as the raw metric value (one event), or - immediately published as the raw metric values (one or more events).
In the former case the event conveying the raw metric value will also have the property destination-key
set with the respective pod name and namespace.
The following examples demonstrate how to configure the metric collection from Netdata, for a few typical use cases.
Node metrics refer to the system (VM or computer) hosting the Kubernetes node.
The following screenshot of the Nebulous GUI, gives the definition of the current_cpu
raw metric that collects the measurements
of system.cpu
Netdata metric.
This metric encompasses several dimensions (e.g. system
, user
, iowait
etc) with an individual measurement each, hence resulting
in multiple metric values.
Based on the raw metric definition, the K8S Netdata collector plugin will query Netdata agent using the following settings.
Since, no results aggregation is set in raw metric definition, each dimension measurement will be published as a separate event,
in EPA event broker. The values of destination-key
property of each event give the name of the respective dimension.
This is a variation of the previous example where results aggregation is used.
In order to have the total system CPU consumption we must redefine the current_cpu
raw metric to sum up the individual
dimension measurements into a single value.
The REST call to Netdata agent will use the following settings.
In this case the output will be a single event conveying the total (sum) CPU, and destination-key
property is not set.
Kubernetes metrics refer to the pods running in a Kubernetes cluster or in the cluster itself. Note that Kubernetes cluster metrics are different than the (corresponding) host metrics.
The following screenshot of the Nebulous GUI, gives the definition of the current_cpu
raw metric that collects the measurements
of k8s.cgroup.cpu
Netdata metric, and retains only those pertaining to the pods corresponding to the component the raw metric
applies to, i.e. testm1-server
.
This metric encompasses different measurements for each testm1-server
pod, hence resulting in multiple metric values.
NOTE: Each Netdata agent will report measurements for pods running in the same node. Therefore, pods of the same component type running in different nodes will be reported by different Netdata agents, queries by different EPAs.
The REST call to Netdata agent will use the following settings.
The next screenshot gives a list of all running pods in the cluster. There is only one pod matching the testm1-server
component
and default
namespace.
Since, no results aggregation is set in raw metric definition, each pod measurement will be published as a separate event,
in the EPA event broker. The values of destination-key
property of each event give the name and namespace of the respective pod.
In this example it is destination-key=default,testm1-server-6789487c9b-76pdz
.
This is a variation of the previous example, where more pods are involved. The raw metric of the previous example would reported
an individual event for each pod (i.e. the CPU consumed by each pod).
In this case we will define raw metric for collecting the total (sum) CPU consumed by all testm1-server
pods.
The following screenshot of the Nebulous GUI, gives the definition of the current_cpu
raw metric that collects the measurements
of k8s.cgroup.cpu
Netdata metric, and retains those pertaining to the testm1-server
pods (where raw metric applies to).
Since the results-aggregation
is set to SUM
there will be a single metric value conveying the sum of the individual pod CPU measurements.
The REST call to Netdata agent will use the following settings.
The next screenshot gives a list of all running pods in the cluster. There are two pods matching the testm1-server
component
and default
namespace per worker node.
Since results aggregation is set to SUM in raw metric definition, all pod measurements will be aggregated per node, and
a single value will be published in the EPA event broker. In this case the value of destination-key
property will be omitted.
This is a variation of the previous examples, where all pods in a cluster node are involved. In this case we will define a raw metric for summing CPU consumed by all node pods.
The following screenshot of the Nebulous GUI, gives the definition of the current_cpu
raw metric that collects the measurements
of k8s.cgroup.cpu
Netdata metric, and includes all pods.
Again the results-aggregation
is set to SUM
, therefore a single metric value event is published, conveying the K8S cluster node CPU.
The REST call to Netdata agent will use the following settings.
Since results aggregation is set to SUM in raw metric definition, all pod measurements will be aggregated per node, and
a single value will be published in the EPA event broker. In this case the value of destination-key
property will be omitted.
NOTE:
Each EPA at each Kubernetes node will report the node's total CPU, but not a cluster-wide value. An overall cluster value (including all nodes) would require a composite metric that aggregates the EPA raw metric values.
Sample list of Netdata metrics. It may vary between devices based on the hardware, architecture, OS, but also the installed software.
Scroll horizontally to view all columns.
app.cpu_context_switches | ipv4.sockstat_udp_mem | k8s_kubelet.kubelet_pleg_relist_interval_microseconds | system.clock_status |
app.cpu_utilization | ipv4.sockstat_udp_sockets | k8s_kubelet.kubelet_pleg_relist_latency_microseconds | system.clock_sync_offset |
app.disk_logical_io | ipv4.sockstat_udplite_sockets | k8s_kubelet.kubelet_pods_log_filesystem_used_bytes | system.clock_sync_state |
app.disk_physical_io | ipv4.udperrors | k8s_kubelet.kubelet_pods_running | system.cpu |
app.fds_open | ipv4.udplite | k8s_kubelet.kubelet_runtime_operations | system.ctxt |
app.fds_open_limit | ipv4.udplite_errors | k8s_kubelet.kubelet_token_requests | system.entropy |
app.mem_page_faults | ipv4.udppackets | k8s_kubelet.rest_client_requests_by_code | system.file_nr_used |
app.mem_private_usage | ipv6.bcast | k8s_kubelet.rest_client_requests_by_method | system.file_nr_utilization |
app.mem_usage | ipv6.ect | k8s_kubelet.volume_manager_total_volumes | system.forks |
app.processes | ipv6.errors | k8s_kubeproxy.http_request_duration | system.idlejitter |
app.swap_usage | ipv6.fragsin | k8s_kubeproxy.kubeproxy_sync_proxy_rules | system.interrupts |
app.threads | ipv6.fragsout | k8s_kubeproxy.kubeproxy_sync_proxy_rules_latency | system.intr |
app.uptime | ipv6.groupmemb | k8s_kubeproxy.kubeproxy_sync_proxy_rules_latency_microseconds | system.io |
app.vmem_usage | ipv6.icmp | k8s_kubeproxy.rest_client_requests_by_code | system.ip |
disk.avgsz | ipv6.icmpechos | k8s_kubeproxy.rest_client_requests_by_method | system.ipc_semaphore_arrays |
disk.await | ipv6.icmperrors | mem.available | system.ipc_semaphores |
disk.backlog | ipv6.icmpmldv2 | mem.balloon | system.ipv6 |
disk.busy | ipv6.icmpneighbor | mem.cma | system.load |
disk.inodes | ipv6.icmpredir | mem.committed | system.net |
disk.io | ipv6.icmprouter | mem.directmaps | system.pgpgio |
disk.iotime | ipv6.icmptypes | mem.fragmentation_index_dma | system.processes |
disk.mops | ipv6.mcast | mem.fragmentation_index_dma32 | system.processes_state |
disk.ops | ipv6.mcastpkts | mem.fragmentation_index_normal | system.ram |
disk.qops | ipv6.packets | mem.kernel | system.shared_memory_bytes |
disk.space | ipv6.sockstat6_frag_sockets | mem.ksm_cow | system.shared_memory_segments |
disk.svctm | ipv6.sockstat6_raw_sockets | mem.oom_kill | system.softirqs |
disk.util | ipv6.sockstat6_tcp_sockets | mem.pgfaults | system.softnet_stat |
disk_ext.avgsz | ipv6.sockstat6_udp_sockets | mem.reclaiming | system.uptime |
disk_ext.await | ipv6.sockstat6_udplite_sockets | mem.slab | systemd.service.memory.failcnt |
disk_ext.io | ipv6.udperrors | mem.swap | systemd.service.memory.paging.faults |
disk_ext.iotime | ipv6.udpliteerrors | mem.swap_cached | systemd.service.memory.paging.io |
disk_ext.mops | ipv6.udplitepackets | mem.swapio | systemd.service.memory.ram.usage |
disk_ext.ops | ipv6.udppackets | mem.thp | systemd.service.memory.usage |
ip.sockstat_sockets | ipvs.net | mem.thp_collapse | systemd.service.memory.writeback |
ip.tcp_accept_queue | ipvs.packets | mem.thp_compact | systemd.service.pids.current |
ip.tcp_syn_queue | ipvs.sockets | mem.thp_details | user.cpu_context_switches |
ip.tcpconnaborts | k8s.cgroup.cpu | mem.thp_faults | user.cpu_utilization |
ip.tcperrors | k8s.cgroup.cpu_limit | mem.thp_file | user.disk_logical_io |
ip.tcphandshake | k8s.cgroup.mem | mem.thp_split | user.disk_physical_io |
ip.tcpmemorypressures | k8s.cgroup.mem_activity | mem.thp_swapout | user.fds_open |
ip.tcpofo | k8s.cgroup.mem_failcnt | mem.thp_zero | user.fds_open_limit |
ip.tcpopens | k8s.cgroup.mem_usage | mem.writeback | user.mem_page_faults |
ip.tcppackets | k8s.cgroup.mem_usage_limit | mem.zswapio | user.mem_private_usage |
ip.tcpreorders | k8s.cgroup.mem_utilization | net.carrier | user.mem_usage |
ip.tcpsock | k8s.cgroup.net_carrier | net.drops | user.processes |
ip.tcpsyncookies | k8s.cgroup.net_drops | net.errors | user.swap_usage |
ipv4.bcast | k8s.cgroup.net_errors | net.events | user.threads |
ipv4.bcastpkts | k8s.cgroup.net_events | net.fifo | user.uptime |
ipv4.ecnpkts | k8s.cgroup.net_fifo | net.mtu | user.vmem_usage |
ipv4.errors | k8s.cgroup.net_mtu | net.net | usergroup.cpu_context_switches |
ipv4.fragsin | k8s.cgroup.net_net | net.operstate | usergroup.cpu_utilization |
ipv4.fragsout | k8s.cgroup.net_operstate | net.packets | usergroup.disk_logical_io |
ipv4.icmp | k8s.cgroup.net_packets | netfilter.conntrack_sockets | usergroup.disk_physical_io |
ipv4.icmp_errors | k8s.cgroup.pgfaults | netfilter.synproxy_conn_reopened | usergroup.fds_open |
ipv4.icmpmsg | k8s.cgroup.pids_current | netfilter.synproxy_cookies | usergroup.fds_open_limit |
ipv4.mcast | k8s.cgroup.writeback | netfilter.synproxy_syn_received | usergroup.mem_page_faults |
ipv4.mcastpkts | k8s_kubelet.apiserver_audit_requests_rejected | sctp.chunks | usergroup.mem_private_usage |
ipv4.packets | k8s_kubelet.apiserver_storage_data_key_generation_failures | sctp.established | usergroup.mem_usage |
ipv4.sockstat_frag_mem | k8s_kubelet.apiserver_storage_data_key_generation_latencies | sctp.fragmentation | usergroup.processes |
ipv4.sockstat_frag_sockets | k8s_kubelet.apiserver_storage_data_key_generation_latencies_percent | sctp.packet_errors | usergroup.swap_usage |
ipv4.sockstat_raw_sockets | k8s_kubelet.apiserver_storage_envelope_transformation_cache_misses | sctp.packets | usergroup.threads |
ipv4.sockstat_tcp_mem | k8s_kubelet.kubelet_containers_running | sctp.transitions | usergroup.uptime |
ipv4.sockstat_tcp_sockets | k8s_kubelet.kubelet_node_config_error | system.active_processes | usergroup.vmem_usage |
[1] Netdata site
[2] Netdata Queries/Lookup
[3] Netdata API
Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the Directorate-General for Communications Networks, Content and Technology. Neither the European Union nor the granting authority can be held responsible for them.
© 2024 NEBULOUS. ALL RIGHTS RESERVED