Skip to content

3.1.3.2 Publishing system and cluster metrics using Netdata

ipatini edited this page Jul 11, 2024 · 19 revisions

This method of acquiring system and K8S metric values involves the deployment of one Netdata agent at every K8S cluster node. Netdata is an open source software for collecting metrics, displaying them as charts, but also providing them through a REST API. The default Nebulous application deployment scenario installs Netdata agents along with EMS at application clusters. EPAs will periodically contact the REST API server of each Netdata agent and scrape the required metrics. To enable EPAs scrape the Netdata agents, it is required that the application metric model provides the needed configuration. For each raw metric that will have its values using this method, it is necessary to define a sensor of “netdata” type and provide the corresponding configuration (including the scraping period).

NOTE:
At each Kubernetes cluster node, exactly one instance of EPA and one instance of Netdata agent is deployed, as DaemonSets. At runtime, each EPA queries its collocated Netdata agent. The <NETDATA_IP_ADDRESS> is the node's IP address and it is provided by Kubernetes through the Downward API.

In order to define a raw metric that takes its values from Netdata agents, the netdata type must be entered in the Sensor field in Nebulous GUI. This will instruct EPA to use its K8S Netdata collector plugin for retrieving the values. Under the hood the K8S Netdata collector plugin will build a URL of the form:
http://<NETDATA_IP_ADDRESS>:<PORT>/<PATH>?<QUERY_PARAMS>&format=ssv
and attempt to retrieve the relevant JSON response from there. Following, it will extract the value(s) of the metric of interest (see next) and publish it/them as the raw metric's value(s) in EPA broker. If configured, it will also aggregate multiple values into a single one.

The following screenshot explains how a raw metric collecting values from Netdata can be setup in Nebulous GUI (Metric model editor).

netdata-config-for-node-1-with-comments

If no configuration is provided (i.e. defaults apply) then the Metric of interest can be provided in the Sensor field, after the sensor type netdata.

netdata-simple-with-comments

Configuration

In order to build the URL, the collector plugin will use the provided configuration settings, or the corresponding defaults. If the metric of interest is a Kubernetes-related metric (its name starts with k8s.) the collector plugin can take into consideration the pod name and namespace. The metric of interest must be provided in the configuration.

The configuration comprises a few settings used to guide the collector plugin, while the remaining are used to build the QUERY_PARAMS list of the URL.

The plugin-specific configuration settings, along with their respective defaults, are:

Plugin Setting Type Default value Comments
endpoint String /api/v2/data The <PATH> part of the URL. Only the v2 version has been tested.
port Port 19999 The <PORT> part of the URL. Allowed values: 1..65535.
components String component name In case of K8S metric of interest, specifies which pod(s) to pick. If left empty it will pick all pods in the namespace. If omitted it will use the name of the component(s) the raw metric applies to.
namespace String default In case of K8S metric of interest, specifies the pod namespace to use. If left empty it will pick matching pods from all namespaces. If omitted it defaults to default namespace.
results-aggregation Enum no default Allowed values: SUM, AVERAGE, COUNT, MIN, MAX, NONE. If omitted or is NONE individual events will be published for each metric value.
intervalPeriod Positive Integer 60 How often to query Netdata API.
intervalUnit Enum SECONDS The time unit of intervalPeriod. Allowed values: SECONDS, MINUTES, HOURS, DAYS

If intervalPeriod is omitted, the querying period is taken from the raw metric Output interval and unit fields. If they are not specified either, it is assumed to be 60 seconds.

The settings used to build the Netdata URL, along with their respective defaults, are:

Netdata Setting Type Default value Comments
scope_contexts String no default REQUIRED: The metric(s) of interest to extract. Can be a comma-separated list
context String no default Can be used instead of scope_contexts. Check Netdata documentation for details
dimension String * The scope_context dimensions to use
after Long -1 Selects the measurements taken the last second (before now)
time_group Enum average Defines the method of grouping of multiple measurements

The settings in the table above are always added in the URL (either the provided value or the default). Any additional settings provided (not listed in tables above) will also be included in query parameters list.

For a complete list of the supported query parameters, and their semantics, please consult the official Netdata documentation on the topic. You can also check Netdata API.


Kubernetes metrics

Netdata metrics pertaining to Kubernetes are named using the k8s. prefix. For instance, k8s.cgroup.cpu. When a metric can be measured per pod, there will be one measurement (value) for each pod in every Netdata response (e.g. CPU consumed by each pod). The K8S Netdata collector plugin will filter these values in order to retain only those complying to the raw metric specification.

NOTE:
The K8S Netdata collector plugin will attempt to extract metric values from the JSON response section under view.dimensions.ids and view.dimensions.values. If the metric of interest is a K8S metric, the ids represent different pods (running at various namespaces). Plugin will filter ids (i.e. pods) based on the provided component name(s) and namespace.

IMPORTANT:
Pod names may be different than component names (as they appear in the metric model). For instance, using Helm will prepend the deployment name in front of each component name to generate pod names. In this case it is essential to set the components setting with a value including the Helm deployment prefix (or any other naming deviation).

The following table details the outcome of each possible combination of components and namespace configuration settings. For each pod selected, the corresponding measurement (metric value) will be kept for further processing. The rest will be filtered out.

components namespace Pods selected
provided provided Pod with name included in components list, and running in the namespace specified in namespace setting
blank provided All pods running in the namespace specified in namespace setting
omitted provided Pods named exactly as one of the metric model components where raw metric applies to, and running in the namespace specified in namespace setting
provided blank Pod with name included in components list, running in any namespace
blank blank All pods, running in any namespace
omitted blank Pods named exactly as one of the metric model components where raw metric applies to, and running in any namespace
provided omitted Pod with name included in components list, running in namespace named default
blank omitted All pods, running in namespace named default
omitted omitted Pods named exactly as one of the metric model components where raw metric applies to, and running in namespace named default

The values retained (each pertaining to one pod), can either:

  • be aggregated according to result-aggregation value, and then published as the raw metric value (one event), or
  • immediately published as the raw metric values (one or more events).

In the former case the event conveying the raw metric value will also have the property destination-key set with the respective pod name and namespace.


Examples

The following examples demonstrate how to configure the metric collection from Netdata, for a few typical use cases.

Node metric

Node metrics refer to the system (VM or computer) hosting the Kubernetes node.

The following screenshot of the Nebulous GUI, gives the definition of the current_cpu raw metric that collects the measurements of system.cpu Netdata metric. This metric encompasses several dimensions (e.g. system, user, iowait etc) with an individual measurement each, hence resulting in multiple metric values.

netdata-example1--metric-model

Based on the raw metric definition, the K8S Netdata collector plugin will query Netdata agent using the following settings.

netdata-example1--rest-call

Since, no results aggregation is set in raw metric definition, each dimension measurement will be published as a separate event, in EPA event broker. The values of destination-key property of each event give the name of the respective dimension.

netdata-example1--events

Node metric with aggregation

This is a variation of the previous example where results aggregation is used. In order to have the total system CPU consumption we must redefine the current_cpu raw metric to sum up the individual dimension measurements into a single value.

netdata-example2--metric-model

The REST call to Netdata agent will use the following settings.

netdata-example2--rest-call

In this case the output will be a single event conveying the total (sum) CPU, and destination-key property is not set.

netdata-example2--events

Single pod metric (K8S)

Kubernetes metrics refer to the pods running in a Kubernetes cluster or in the cluster itself. Note that Kubernetes cluster metrics are different than the (corresponding) host metrics.

The following screenshot of the Nebulous GUI, gives the definition of the current_cpu raw metric that collects the measurements of k8s.cgroup.cpu Netdata metric, and retains only those pertaining to the pods corresponding to the component the raw metric applies to, i.e. testm1-server. This metric encompasses different measurements for each testm1-server pod, hence resulting in multiple metric values.

netdata-example3--metric-model

NOTE: Each Netdata agent will report measurements for pods running in the same node. Therefore, pods of the same component type running in different nodes will be reported by different Netdata agents, queries by different EPAs.

The REST call to Netdata agent will use the following settings.

netdata-example3--rest-call

The next screenshot gives a list of all running pods in the cluster. There is only one pod matching the testm1-server component and default namespace.

netdata-example3--pod-list

Since, no results aggregation is set in raw metric definition, each pod measurement will be published as a separate event, in the EPA event broker. The values of destination-key property of each event give the name and namespace of the respective pod. In this example it is destination-key=default,testm1-server-6789487c9b-76pdz.

netdata-example3--events

Multiple pod metrics with aggregation (K8S)

This is a variation of the previous example, where more pods are involved. The raw metric of the previous example would reported an individual event for each pod (i.e. the CPU consumed by each pod). In this case we will define raw metric for collecting the total (sum) CPU consumed by all testm1-server pods.

The following screenshot of the Nebulous GUI, gives the definition of the current_cpu raw metric that collects the measurements of k8s.cgroup.cpu Netdata metric, and retains those pertaining to the testm1-server pods (where raw metric applies to). Since the results-aggregation is set to SUM there will be a single metric value conveying the sum of the individual pod CPU measurements.

netdata-example4--metric-model

The REST call to Netdata agent will use the following settings.

netdata-example4--rest-call

The next screenshot gives a list of all running pods in the cluster. There are two pods matching the testm1-server component and default namespace per worker node.

netdata-example4--pod-list

Since results aggregation is set to SUM in raw metric definition, all pod measurements will be aggregated per node, and a single value will be published in the EPA event broker. In this case the value of destination-key property will be omitted.

netdata-example4--events

Node-wide metric with aggregation (K8S)

This is a variation of the previous examples, where all pods in a cluster node are involved. In this case we will define a raw metric for summing CPU consumed by all node pods.

The following screenshot of the Nebulous GUI, gives the definition of the current_cpu raw metric that collects the measurements of k8s.cgroup.cpu Netdata metric, and includes all pods. Again the results-aggregation is set to SUM, therefore a single metric value event is published, conveying the K8S cluster node CPU.

netdata-example5--metric-model

The REST call to Netdata agent will use the following settings.

netdata-example5--rest-call

Since results aggregation is set to SUM in raw metric definition, all pod measurements will be aggregated per node, and a single value will be published in the EPA event broker. In this case the value of destination-key property will be omitted.

netdata-example5--events

NOTE:
Each EPA at each Kubernetes node will report the node's total CPU, but not a cluster-wide value. An overall cluster value (including all nodes) would require a composite metric that aggregates the EPA raw metric values.


Appendix

Sample list of Netdata metrics. It may vary between devices based on the hardware, architecture, OS, but also the installed software.

Scroll horizontally to view all columns.

app.cpu_context_switches ipv4.sockstat_udp_mem k8s_kubelet.kubelet_pleg_relist_interval_microseconds system.clock_status
app.cpu_utilization ipv4.sockstat_udp_sockets k8s_kubelet.kubelet_pleg_relist_latency_microseconds system.clock_sync_offset
app.disk_logical_io ipv4.sockstat_udplite_sockets k8s_kubelet.kubelet_pods_log_filesystem_used_bytes system.clock_sync_state
app.disk_physical_io ipv4.udperrors k8s_kubelet.kubelet_pods_running system.cpu
app.fds_open ipv4.udplite k8s_kubelet.kubelet_runtime_operations system.ctxt
app.fds_open_limit ipv4.udplite_errors k8s_kubelet.kubelet_token_requests system.entropy
app.mem_page_faults ipv4.udppackets k8s_kubelet.rest_client_requests_by_code system.file_nr_used
app.mem_private_usage ipv6.bcast k8s_kubelet.rest_client_requests_by_method system.file_nr_utilization
app.mem_usage ipv6.ect k8s_kubelet.volume_manager_total_volumes system.forks
app.processes ipv6.errors k8s_kubeproxy.http_request_duration system.idlejitter
app.swap_usage ipv6.fragsin k8s_kubeproxy.kubeproxy_sync_proxy_rules system.interrupts
app.threads ipv6.fragsout k8s_kubeproxy.kubeproxy_sync_proxy_rules_latency system.intr
app.uptime ipv6.groupmemb k8s_kubeproxy.kubeproxy_sync_proxy_rules_latency_microseconds system.io
app.vmem_usage ipv6.icmp k8s_kubeproxy.rest_client_requests_by_code system.ip
disk.avgsz ipv6.icmpechos k8s_kubeproxy.rest_client_requests_by_method system.ipc_semaphore_arrays
disk.await ipv6.icmperrors mem.available system.ipc_semaphores
disk.backlog ipv6.icmpmldv2 mem.balloon system.ipv6
disk.busy ipv6.icmpneighbor mem.cma system.load
disk.inodes ipv6.icmpredir mem.committed system.net
disk.io ipv6.icmprouter mem.directmaps system.pgpgio
disk.iotime ipv6.icmptypes mem.fragmentation_index_dma system.processes
disk.mops ipv6.mcast mem.fragmentation_index_dma32 system.processes_state
disk.ops ipv6.mcastpkts mem.fragmentation_index_normal system.ram
disk.qops ipv6.packets mem.kernel system.shared_memory_bytes
disk.space ipv6.sockstat6_frag_sockets mem.ksm_cow system.shared_memory_segments
disk.svctm ipv6.sockstat6_raw_sockets mem.oom_kill system.softirqs
disk.util ipv6.sockstat6_tcp_sockets mem.pgfaults system.softnet_stat
disk_ext.avgsz ipv6.sockstat6_udp_sockets mem.reclaiming system.uptime
disk_ext.await ipv6.sockstat6_udplite_sockets mem.slab systemd.service.memory.failcnt
disk_ext.io ipv6.udperrors mem.swap systemd.service.memory.paging.faults
disk_ext.iotime ipv6.udpliteerrors mem.swap_cached systemd.service.memory.paging.io
disk_ext.mops ipv6.udplitepackets mem.swapio systemd.service.memory.ram.usage
disk_ext.ops ipv6.udppackets mem.thp systemd.service.memory.usage
ip.sockstat_sockets ipvs.net mem.thp_collapse systemd.service.memory.writeback
ip.tcp_accept_queue ipvs.packets mem.thp_compact systemd.service.pids.current
ip.tcp_syn_queue ipvs.sockets mem.thp_details user.cpu_context_switches
ip.tcpconnaborts k8s.cgroup.cpu mem.thp_faults user.cpu_utilization
ip.tcperrors k8s.cgroup.cpu_limit mem.thp_file user.disk_logical_io
ip.tcphandshake k8s.cgroup.mem mem.thp_split user.disk_physical_io
ip.tcpmemorypressures k8s.cgroup.mem_activity mem.thp_swapout user.fds_open
ip.tcpofo k8s.cgroup.mem_failcnt mem.thp_zero user.fds_open_limit
ip.tcpopens k8s.cgroup.mem_usage mem.writeback user.mem_page_faults
ip.tcppackets k8s.cgroup.mem_usage_limit mem.zswapio user.mem_private_usage
ip.tcpreorders k8s.cgroup.mem_utilization net.carrier user.mem_usage
ip.tcpsock k8s.cgroup.net_carrier net.drops user.processes
ip.tcpsyncookies k8s.cgroup.net_drops net.errors user.swap_usage
ipv4.bcast k8s.cgroup.net_errors net.events user.threads
ipv4.bcastpkts k8s.cgroup.net_events net.fifo user.uptime
ipv4.ecnpkts k8s.cgroup.net_fifo net.mtu user.vmem_usage
ipv4.errors k8s.cgroup.net_mtu net.net usergroup.cpu_context_switches
ipv4.fragsin k8s.cgroup.net_net net.operstate usergroup.cpu_utilization
ipv4.fragsout k8s.cgroup.net_operstate net.packets usergroup.disk_logical_io
ipv4.icmp k8s.cgroup.net_packets netfilter.conntrack_sockets usergroup.disk_physical_io
ipv4.icmp_errors k8s.cgroup.pgfaults netfilter.synproxy_conn_reopened usergroup.fds_open
ipv4.icmpmsg k8s.cgroup.pids_current netfilter.synproxy_cookies usergroup.fds_open_limit
ipv4.mcast k8s.cgroup.writeback netfilter.synproxy_syn_received usergroup.mem_page_faults
ipv4.mcastpkts k8s_kubelet.apiserver_audit_requests_rejected sctp.chunks usergroup.mem_private_usage
ipv4.packets k8s_kubelet.apiserver_storage_data_key_generation_failures sctp.established usergroup.mem_usage
ipv4.sockstat_frag_mem k8s_kubelet.apiserver_storage_data_key_generation_latencies sctp.fragmentation usergroup.processes
ipv4.sockstat_frag_sockets k8s_kubelet.apiserver_storage_data_key_generation_latencies_percent sctp.packet_errors usergroup.swap_usage
ipv4.sockstat_raw_sockets k8s_kubelet.apiserver_storage_envelope_transformation_cache_misses sctp.packets usergroup.threads
ipv4.sockstat_tcp_mem k8s_kubelet.kubelet_containers_running sctp.transitions usergroup.uptime
ipv4.sockstat_tcp_sockets k8s_kubelet.kubelet_node_config_error system.active_processes usergroup.vmem_usage

Additional reading

[1] Netdata site
[2] Netdata Queries/Lookup
[3] Netdata API

Clone this wiki locally