-
Notifications
You must be signed in to change notification settings - Fork 80
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
troubleshoot notification service by multiple agents
- Loading branch information
Showing
6 changed files
with
164 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456789 node-01 kubelet[12345]: OOMKilling POD uid: "b1234c56-d78e-11e9-8a28-0242ac110002", name: "node-ui-service-pod", namespace: "my-namespace", container: "node-ui-service-container", Memory cgroup out of memory: Kill process 5678 (node-ui-service-container) score 1000 or sacrifice child | ||
August 31, 2023 1:27:12 node-01 kernel: [345678.123456] Memory cgroup out of memory: Kill process 5678 (node-ui-service-container) score 1000 or sacrifice child | ||
August 31, 2023 1:27:12 node-01 kernel: [345678.123457] Killed process 5678 (node-ui-service-container) total-vm:123456kB, anon-rss:12345kB, file-rss:0kB, shmem-rss:0kB | ||
August 31, 2023 1:27:12 node-01 kubelet[12345]: I0628 1:27:12.456799 node-01 kubelet[12345]: pod "node-ui-service-pod_my-namespace(b1234c56-d78e-11e9-8a28-0242ac110002)" failed due to OOM Killer. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 10 additions & 5 deletions
15
genia/tools_config/skills/troubleshoot_notification_service.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,12 @@ | ||
To troubleshoot a service in production, follow these steps: | ||
|
||
1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, cluster_size and k8s_crash_loopbacks | ||
2. comparing to the entire data set of 30 minutes, carefully look for a sudden increase in each of the metrics data in past few minutes as an anomaly for this metrics data, take it step by step and validate you have the right answer | ||
3. the service consumes from a kafka queue, call function 'fetch_grafana_observability_metric_data' using its key metric 'kafka_lag_size' which means the number of input messages waiting in the queue to be handled by the service | ||
4. call function 'fetch_k8s_service_log_data' with the problematic service name and log name 'notification-service-k8s-logs' and look for k8s errors that might have caused the issue | ||
5. summarize your finding and in an actionalble way with recommendations what to do next | ||
1. call function 'fetch_grafana_observability_metric_data' serially to fetch all the Grafana service metrics data with the problematic service name and each time with one of the following metric name: cpu, memory, cluster_size and k8s_crash_loopbacks, look at each of the data sets in each step seperately and detect anomalies by comparing the last 5 minutes to the preceding 25 minutes average and standard deviation, look for z-score greater than 3 or less than -3 for this metrics data, validate you have the right answer. | ||
2. the service consumes from a kafka queue, call function 'fetch_grafana_observability_metric_data' using its key metric 'kafka_lag_size' which means the number of input messages waiting in the queue to be handled by the service | ||
3. call function 'fetch_k8s_service_log_data' with the problematic service name and log name 'node-ui-service-k8s-logs' and look for k8s errors that might have caused the issue | ||
4. print your finding in 3 sections: | ||
Report: | ||
for each data collected print the name and short description of the findings | ||
Insights summary: | ||
short summary and insights of the findings | ||
Recommendations: | ||
suggest the user what should be done next |