Skip to content

Commit

Permalink
k8s log
Browse files Browse the repository at this point in the history
  • Loading branch information
shlomsh committed Aug 31, 2023
1 parent c6a5cc8 commit f8667e3
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 2 deletions.
4 changes: 4 additions & 0 deletions data/demo/logs/notification-service-k8s-logs.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456789 node-01 kubelet[12345]: OOMKilling POD uid: "b1234c56-d78e-11e9-8a28-0242ac110002", name: "notification-service-pod", namespace: "my-namespace", container: "notification-service-container", Memory cgroup out of memory: Kill process 5678 (notification-service-container) score 1000 or sacrifice child
August 31, 2023 1:27:12 node-01 kernel: [345678.123456] Memory cgroup out of memory: Kill process 5678 (notification-service-container) score 1000 or sacrifice child
August 31, 2023 1:27:12 node-01 kernel: [345678.123457] Killed process 5678 (notification-service-container) total-vm:123456kB, anon-rss:12345kB, file-rss:0kB, shmem-rss:0kB
August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456799 node-01 kubelet[12345]: pod "notification-service-pod_my-namespace(b1234c56-d78e-11e9-8a28-0242ac110002)" failed due to OOM Killer.
2 changes: 1 addition & 1 deletion genia/settings/settings.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ timeout=45
temperature=0

[chat]
max_user_message_chain=20
max_user_message_chain=10
max_chat_tokens=4000
# note the limit is in length, not number of tokens
max_chat_function_len=4000
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
To troubleshoot a service in production, follow these steps:

1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, iops, cluster_size and k8s_crash_loopbacks
1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, cluster_size and k8s_crash_loopbacks
2. comparing to the entire data set of 30 minutes, carefully look for a sudden increase in each of the metrics data in past few minutes as an anomaly for this metrics data, take it step by step and validate you have the right answer
3. the service consumes from a kafka queue, call function 'fetch_grafana_observability_metric_data' using its key metric 'kafka_lag_size' which means the number of input messages waiting in the queue to be handled by the service
4. the service is a kubernetes (k8s) based service, call function 'fetch_k8s_service_log_data' with log name 'notification-service-k8s-logs'
Expand Down

0 comments on commit f8667e3

Please sign in to comment.