k8s log

genia-dev · Aug 31, 2023 · f8667e3 · f8667e3
1 parent c6a5cc8
commit f8667e3
Show file tree

Hide file tree

Showing 3 changed files with 6 additions and 2 deletions.
diff --git a/data/demo/logs/notification-service-k8s-logs.txt b/data/demo/logs/notification-service-k8s-logs.txt
@@ -0,0 +1,4 @@
+August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456789 node-01 kubelet[12345]: OOMKilling POD uid: "b1234c56-d78e-11e9-8a28-0242ac110002", name: "notification-service-pod", namespace: "my-namespace", container: "notification-service-container", Memory cgroup out of memory: Kill process 5678 (notification-service-container) score 1000 or sacrifice child
+August 31, 2023 1:27:12 node-01 kernel: [345678.123456] Memory cgroup out of memory: Kill process 5678 (notification-service-container) score 1000 or sacrifice child
+August 31, 2023 1:27:12 node-01 kernel: [345678.123457] Killed process 5678 (notification-service-container) total-vm:123456kB, anon-rss:12345kB, file-rss:0kB, shmem-rss:0kB
+August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456799 node-01 kubelet[12345]: pod "notification-service-pod_my-namespace(b1234c56-d78e-11e9-8a28-0242ac110002)" failed due to OOM Killer.
diff --git a/genia/settings/settings.toml b/genia/settings/settings.toml
@@ -6,7 +6,7 @@ timeout=45
 temperature=0
 
 [chat]
-max_user_message_chain=20
+max_user_message_chain=10
 max_chat_tokens=4000
 # note the limit is in length, not number of tokens
 max_chat_function_len=4000

diff --git a/genia/tools_config/skills/troubleshoot_notification_service.txt b/genia/tools_config/skills/troubleshoot_notification_service.txt
@@ -1,6 +1,6 @@
 To troubleshoot a service in production, follow these steps:
 
-1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, iops, cluster_size and k8s_crash_loopbacks
+1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, cluster_size and k8s_crash_loopbacks
 2. comparing to the entire data set of 30 minutes, carefully look for a sudden increase in each of the metrics data in past few minutes as an anomaly for this metrics data, take it step by step and validate you have the right answer
 3. the service consumes from a kafka queue, call function 'fetch_grafana_observability_metric_data' using its key metric 'kafka_lag_size' which means the number of input messages waiting in the queue to be handled by the service
 4. the service is a kubernetes (k8s) based service, call function 'fetch_k8s_service_log_data' with log name 'notification-service-k8s-logs'