kubernetes.io > Concepts > Cluster Administration > Logging Architecture
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Logging Using Elasticsearch and Kibana
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Troubleshooting
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Troubleshoot Applications
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Troubleshoot Clusters
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Debug Pods and ReplicationControllers
kubernetes.io > Tasks > Monitoring, Logging, and Debugging > Debug Services
- Shell into the failing Pod/container
- Deploy similar Pod/container with busybox
- DNS:
dig
tcpdump
-
Prometheus for monitoring
-
Grafana for visualization of collected metrics from Prometheus.
-
Fluentd for logging and feed aggregated logs to Elasticsearch
-
ELK stack of Elastisearch,
Logstachand Kibana. Elasticserch received the aggregated logs from fluentd and use Kibana to visualize them. -
OpenTracing propagate transaction among all services, code and packages.
-
Jaeger is a tracing system focus on distributed context propagation, transaction monitoring and root cause analysis. It's an implementation of OpenTracing.
Assuming a problematic pod:
kubectl create deployment problem --image=nginx
In the following flow, <tab>
means pressing tab
key, and it's used to autocomplete the pod name.
-
Investigate errors from command line
kubectl exec -it problem-<tab> -- /bin/bash
-
If the pod is running, check the logs:
kubectl logs problem-<tab>
Consider deploy a *sidecar- container in the pod to generate and handling logging. These can be configured to stream logs or run a logging agent.
-
Check networking, including DNS, firewalls and general connectivity using Linux commands/tools, example
dig
. -
Check RBAC, SELinux and AppArmor for security settings. These may cause problems with networking.
-
Check nodes logs for errors. Make sure they have enough resources allocated.
-
API calls to and from controllers to
kube-apiserver
-
Inter-node network issues, DNS & Firewall
-
Master server controllers.
- Control Pods state
- Errors in log files
- Sufficient resources
-
From the basic steps, execute steps #1 to #3
-
Is the containerized application working as expected?
Confirm the app is working correctly, check if this is an intermittent issue or related to slow performance.
-
(The app is not the culprit). Make sure the Pods are in *Running- status:
kubectl get pods
The status
Pending
usually means a resource is not available from the cluster. Examples: a properly tainted node, expected storage or enough resources. -
Look at the logs and events of the container
kubectl logs problem-<tab> kubectl describe problem-<tab> kubectl get events
Check the number of restarts. If the restarts are not caused by the command that finished, it may indicate the application is having issues and failing.
-
If there is no info in the events, check the container logs
kubectl logs problem-<tab> <container_name>
-
Disable security for testing. Disable RBAC, SELinux and AppArmor to identify the root cause of the issue
-
Check system and agent logs.
- If they use systemd: Logs will go to journalctl, view the logs with
journalctl -a
and maybe in/var/log/journal/
- Without systemd: Logs created in
/var/log/<agent>.log
In both cases, the logs could have rotation, if not it's advisable to do it.
- If they use systemd: Logs will go to journalctl, view the logs with
Container components:
kube-scheduler
kube-proxy
Non-container components:
kubelet
- Docker
- Others ...
A CNCF program to certify distributions that meets essential requirements and adhere to complete API functionality.
Read more about it on GitHub cncf/k8s-conformance and the instructions.