|
| 1 | +--- |
| 2 | +title: Observability |
| 3 | +reviewers: |
| 4 | +weight: 55 |
| 5 | +content_type: concept |
| 6 | +description: > |
| 7 | + Understand how to gain end-to-end visibility of a Kubernetes cluster through the collection of metrics, logs, and traces. |
| 8 | +no_list: true |
| 9 | +card: |
| 10 | + name: setup |
| 11 | + weight: 60 |
| 12 | + anchors: |
| 13 | + - anchor: "#metrics" |
| 14 | + title: Metrics |
| 15 | + - anchor: "#logs" |
| 16 | + title: Logs |
| 17 | + - anchor: "#traces" |
| 18 | + title: Traces |
| 19 | +--- |
| 20 | + |
| 21 | +<!-- overview --> |
| 22 | + |
| 23 | +In Kubernetes, observability is the process of collecting and analyzing metrics, logs, and traces—often referred to as the three pillars of observability—in order to obtain a better understanding of the internal state, performance, and health of the cluster. |
| 24 | + |
| 25 | +Kubernetes control plane components, as well as many add-ons, generate and emit these signals. By aggregating and correlating them, you can gain a unified picture of the control plane, add-ons, and applications across the cluster. |
| 26 | + |
| 27 | +Figure 1 outlines how cluster components emit the three primary signal types. |
| 28 | + |
| 29 | +{{< mermaid >}} |
| 30 | +flowchart LR |
| 31 | + A[Cluster components] --> M[Metrics pipeline] |
| 32 | + A --> L[Log pipeline] |
| 33 | + A --> T[Trace pipeline] |
| 34 | + M --> S[(Storage and analysis)] |
| 35 | + L --> S |
| 36 | + T --> S |
| 37 | + S --> O[Operators and automation] |
| 38 | +{{< /mermaid >}} |
| 39 | + |
| 40 | +*Figure 1. High-level signals emitted by cluster components and their consumers.* |
| 41 | + |
| 42 | +<!-- body --> |
| 43 | +## Metrics |
| 44 | + |
| 45 | +Kubernetes components emit metrics in [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) from their `/metrics` endpoints, including: |
| 46 | + |
| 47 | +- kube-controller-manager |
| 48 | +- kube-proxy |
| 49 | +- kube-apiserver |
| 50 | +- kube-scheduler |
| 51 | +- kubelet |
| 52 | + |
| 53 | +The kubelet also exposes metrics at `/metrics/cadvisor`, `/metrics/resource`, and `/metrics/probes`, and add-ons such as [kube-state-metrics](/docs/concepts/cluster-administration/kube-state-metrics/) enrich those control plane signals with Kubernetes object status. |
| 54 | + |
| 55 | +A typical Kubernetes metrics pipeline periodically scrapes these endpoints and stores the samples in a time series database (for example with Prometheus). |
| 56 | + |
| 57 | +See the [system metrics guide](/docs/concepts/cluster-administration/system-metrics/) for details and configuration options. |
| 58 | + |
| 59 | +Figure 2 outlines a common Kubernetes metrics pipeline. |
| 60 | + |
| 61 | +{{< mermaid >}} |
| 62 | +flowchart LR |
| 63 | + C[Cluster components] --> P[Prometheus scraper] |
| 64 | + P --> TS[(Time series storage)] |
| 65 | + TS --> D[Dashboards and alerts] |
| 66 | + TS --> A[Automated actions] |
| 67 | +{{< /mermaid >}} |
| 68 | + |
| 69 | +*Figure 2. Components of a typical Kubernetes metrics pipeline.* |
| 70 | + |
| 71 | +For multi-cluster or multi-cloud visibility, distributed time series databases (for example Thanos or Cortex) can complement Prometheus. |
| 72 | + |
| 73 | +See [Common observability tools - metrics tools](#metrics-tools) for metrics scrapers and time series databases. |
| 74 | + |
| 75 | +#### {{% heading "seealso" %}} |
| 76 | + |
| 77 | +- [System metrics for Kubernetes components](/docs/concepts/cluster-administration/system-metrics/) |
| 78 | +- [Resource usage monitoring with metrics-server](/docs/tasks/debug/debug-cluster/resource-usage-monitoring/) |
| 79 | +- [kube-state-metrics concept](/docs/concepts/cluster-administration/kube-state-metrics/) |
| 80 | +- [Resource metrics pipeline overview](/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/) |
| 81 | + |
| 82 | +## Logs |
| 83 | + |
| 84 | +Logs provide a chronological record of events inside applications, Kubernetes system components, and security-related activities such as audit logging. |
| 85 | + |
| 86 | +Container runtimes capture a containerized application’s output from standard output (`stdout`) and standard error (`stderr`) streams. While runtimes implement this differently, the integration with the kubelet is standardized through the _CRI logging format_, and the kubelet makes these logs available through `kubectl logs`. |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | +*Figure 3a. Node-level logging architecture.* |
| 91 | + |
| 92 | +System component logs capture events from the cluster and are often useful for debugging and troubleshooting. These components are classified in two different ways: those that run in a container and those that do not. For example, the `kube-scheduler` and `kube-proxy` usually run in containers, whereas the `kubelet` and the container runtime run directly on the host. |
| 93 | + |
| 94 | +- On machines with `systemd`, the kubelet and container runtime write to journald. Otherwise, they write to `.log` files in the `/var/log` directory. |
| 95 | +- System components that run inside containers always write to `.log` files in `/var/log`, bypassing the default container logging mechanism. |
| 96 | + |
| 97 | +System component and container logs stored under `/var/log` require log rotation to prevent uncontrolled growth. Some cluster provisioning scripts install log rotation by default; verify your environment and adjust as needed. See the [system logs reference](/docs/concepts/cluster-administration/system-logs/) for details on locations, formats, and configuration options. |
| 98 | + |
| 99 | +Most clusters run a node-level logging agent (for example, Fluent Bit or Fluentd) that tails these files and forwards entries to a central log store. The [logging architecture guidance](/docs/concepts/cluster-administration/logging/) explains how to design such pipelines, apply retention, and log flows to backends. |
| 100 | + |
| 101 | +Figure 3 outlines a common log aggregation pipeline. |
| 102 | + |
| 103 | +{{< mermaid >}} |
| 104 | +flowchart LR |
| 105 | + subgraph Sources |
| 106 | + A[Application stdout / stderr] |
| 107 | + B[Control plane logs] |
| 108 | + C[Audit records] |
| 109 | + end |
| 110 | + A --> N[Node log agent] |
| 111 | + B --> N |
| 112 | + C --> N |
| 113 | + N --> L[Central log store] |
| 114 | + L --> Q[Dashboards, alerting, SIEM] |
| 115 | +{{< /mermaid >}} |
| 116 | + |
| 117 | +*Figure 3. Components of a typical Kubernetes logs pipeline.* |
| 118 | + |
| 119 | +See [Common observability tools - logging tools](#logging-tools) for logging agents and central log stores. |
| 120 | + |
| 121 | +#### {{% heading "seealso" %}} |
| 122 | + |
| 123 | +- [Logging architecture](/docs/concepts/cluster-administration/logging/) |
| 124 | +- [System logs](/docs/concepts/cluster-administration/system-logs/) |
| 125 | +- [Logging tasks and tutorials](/docs/tasks/debug/logging/) |
| 126 | +- [Configure audit logging](/docs/tasks/debug/debug-cluster/audit/) |
| 127 | + |
| 128 | +## Traces |
| 129 | + |
| 130 | +Traces capture how requests moves across Kubernetes components and applications, linking latency, timing and relationships between operations.By collecting traces, you can visualize end-to-end request flow, diagnose performance issues, and identify bottlenecks or unexpected interactions in the control plane, add-ons, or applications. |
| 131 | + |
| 132 | +Kubernetes {{< skew currentVersion >}} can export spans over the [OpenTelemetry Protocol](/docs/concepts/cluster-administration/system-traces/) (OTLP), either directly via built-in gRPC exporters or by forwarding them through an OpenTelemetry Collector. |
| 133 | + |
| 134 | +The OpenTelemetry Collector receives spans from components and applications, processes them (for example by applying sampling or redaction), and forwards them to a tracing backend for storage and analysis. |
| 135 | + |
| 136 | +Figure 4 outlines a typical distributed tracing pipeline. |
| 137 | + |
| 138 | +{{< mermaid >}} |
| 139 | +flowchart LR |
| 140 | + subgraph Sources |
| 141 | + A[Control plane spans] |
| 142 | + B[Application spans] |
| 143 | + end |
| 144 | + A --> X[OTLP exporter] |
| 145 | + B --> X |
| 146 | + X --> COL[OpenTelemetry Collector] |
| 147 | + COL --> TS[(Tracing backend)] |
| 148 | + TS --> V[Visualization and analysis] |
| 149 | +{{< /mermaid >}} |
| 150 | + |
| 151 | +*Figure 4. Components of a typical Kubernetes traces pipeline.* |
| 152 | + |
| 153 | +See [Common observability tools - tracing tools](#tracing-tools) for tracing collectors and backends. |
| 154 | + |
| 155 | +#### {{% heading "seealso" %}} |
| 156 | + |
| 157 | +- [System traces for Kubernetes components](/docs/concepts/cluster-administration/system-traces/) |
| 158 | +- [OpenTelemetry Collector getting started guide](https://opentelemetry.io/docs/collector/getting-started/) |
| 159 | +- [Monitoring and tracing tasks](/docs/tasks/debug/monitoring/) |
| 160 | + |
| 161 | +## Common observability tools |
| 162 | + |
| 163 | +{{% thirdparty-content %}} |
| 164 | + |
| 165 | +Note: This section links to third-party projects that provide observability capabilities required by Kubernetes. |
| 166 | +The Kubernetes project authors aren't responsible for these projects, which are listed alphabetically. To add a |
| 167 | +project to this list, read the [content guide](/docs/contribute/style/content-guide/) before submitting a change. |
| 168 | + |
| 169 | +### Metrics tools |
| 170 | + |
| 171 | +- [Cortex](https://cortexmetrics.io/) offers horizontally scalable, long-term Prometheus storage. |
| 172 | +- [Grafana Mimir](https://grafana.com/oss/mimir/) is a Grafana Labs project that provides multi-tenant, horizontally scalable Prometheus-compatible storage. |
| 173 | +- [Prometheus](https://prometheus.io/) is the monitoring system that scrapes and stores metrics from Kubernetes components. |
| 174 | +- [Thanos](https://thanos.io/) extends Prometheus with global querying, downsampling, and object storage support. |
| 175 | + |
| 176 | +### Logging tools |
| 177 | + |
| 178 | +- [Elasticsearch](https://www.elastic.co/elasticsearch/) delivers distributed log indexing and search. |
| 179 | +- [Fluent Bit](https://fluentbit.io/) collects and forwards container and node logs with a low resource footprint. |
| 180 | +- [Fluentd](https://www.fluentd.org/) routes and transforms logs to multiple destinations. |
| 181 | +- [Grafana Loki](https://grafana.com/oss/loki/) stores logs in a Prometheus-inspired, label-based format. |
| 182 | +- [OpenSearch](https://opensearch.org/) provides open source log indexing and search compatible with Elasticsearch APIs. |
| 183 | + |
| 184 | +### Tracing tools |
| 185 | + |
| 186 | +- [Grafana Tempo](https://grafana.com/oss/tempo/) offers scalable, low-cost distributed tracing storage. |
| 187 | +- [Jaeger](https://www.jaegertracing.io/) captures and visualizes distributed traces for microservices. |
| 188 | +- [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) receives, processes, and exports telemetry data including traces. |
| 189 | +- [Zipkin](https://zipkin.io/) provides distributed tracing collection and visualization. |
| 190 | + |
| 191 | +## {{% heading "whatsnext" %}} |
| 192 | + |
| 193 | +- Learn how to [collect resource usage metrics with metrics-server](/docs/tasks/debug/debug-cluster/resource-usage-monitoring/) |
| 194 | +- Explore [logging tasks and tutorials](/docs/tasks/debug/logging/) |
| 195 | +- Follow the [monitoring and tracing task guides](/docs/tasks/debug/monitoring/) |
| 196 | +- Review the [system metrics guide](/docs/concepts/cluster-administration/system-metrics/) for component endpoints and stability |
| 197 | +- Review the [common observability tools](#common-observability-tools) section for vetted third-party options |
0 commit comments