From 9028af05ea712e78821302a483da04b14ab9a60e Mon Sep 17 00:00:00 2001
From: "Craig.Walton" <Craig.Walton@dsit.gov.uk>
Date: Thu, 19 Dec 2024 16:48:45 +0000
Subject: [PATCH] Add troubleshooting guide to docs site based on user
 questions.

---
 docs/docs/tips/debugging-k8s-sandboxes.md | 27 +++++++++--
 docs/docs/tips/troubleshooting.md         | 57 +++++++++++++++++++++++
 docs/mkdocs.yaml                          |  1 +
 3 files changed, 80 insertions(+), 5 deletions(-)
 create mode 100644 docs/docs/tips/troubleshooting.md

diff --git a/docs/docs/tips/debugging-k8s-sandboxes.md b/docs/docs/tips/debugging-k8s-sandboxes.md
index 17fcf53..460e16b 100644
--- a/docs/docs/tips/debugging-k8s-sandboxes.md
+++ b/docs/docs/tips/debugging-k8s-sandboxes.md
@@ -4,12 +4,17 @@ This section explains features of [Inspect](https://inspect.ai-safety-institute.
 and [k9s](https://k9scli.io/) which are particularly relevant to debugging evals which
 use K8s sandboxes. Please see the dedicated docs pages of each for more information.
 
-## Inspect Log Levels
+## Capture Inspect `SANDBOX`-level logs { #sandbox-log-level }
 
-Using `--log-level sandbox` (or setting the `INSPECT_LOG_LEVEL` env var, or passing the
-`log_level` argument to `eval()`) when running an Inspect eval will give you good
-visibility into the Helm charts being installed, the commands being run within the
-containers, and the output of those commands.
+Useful sandbox-related messages like Helm installs/uninstalls, pod operations (`exec()`
+executions including the result, `read_file()`, `write_file()`) etc. are logged at the
+`SANDBOX` log level.
+
+Set Inspect's log level to `SANDBOX` or lower via one of these methods:
+
+ * passing `--log-level sandbox` on the command line
+ * setting `INSPECT_LOG_LEVEL=sandbox` environment variable
+ * passing the `log_level` argument to `eval()` or `eval_set()`
 
 Example:
 
@@ -35,6 +40,18 @@ SANDBOX - K8S: Completed: Execute command in pod. {
 }
 ```
 
+Additionally, ensure the content of the `logging` module is written to a file on disk:
+
+```sh
+mkdir -p logs
+export INSPECT_PY_LOGGER_FILE="logs/inspect_py_log.log"
+```
+
+These will include timestamps and are invaluable when piecing together an ordered
+sequence of events.
+
+Consider including the datetime in the log file name to keep logs separate.
+
 ## Disabling Inspect Cleanup
 
 By default, Inspect will clean up sandboxes (i.e. uninstall Helm releases) after an eval
diff --git a/docs/docs/tips/troubleshooting.md b/docs/docs/tips/troubleshooting.md
new file mode 100644
index 0000000..cf96562
--- /dev/null
+++ b/docs/docs/tips/troubleshooting.md
@@ -0,0 +1,57 @@
+# Troubleshooting
+
+For general K8s and Inspect sandbox debugging, see the [Debugging K8s
+Sandboxes](debugging-k8s-sandboxes.md) guide.
+
+## Capture Inspect `SANDBOX`-level logs
+
+A good starting point to most issues is to capture the output of the Python `logging`
+module at `SANDBOX` level. See the [`SANDBOX` log level
+section](debugging-k8s-sandboxes.md#sandbox-log-level).
+
+## View cluster events
+
+Certain cluster events may impact your eval, for example, a node failure.
+
+```sh
+kubectl get events --sort-by='.metadata.creationTimestamp'
+```
+
+To also see timestamps:
+
+```sh
+kubectl get events --sort-by='.metadata.creationTimestamp' \
+  -o custom-columns=LastSeen:.lastTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason,Message:.message
+```
+
+To filter to a particular release or pod, either pipe into `grep` or use the
+`--field-selector` flag:
+
+```sh
+kubectl get events --sort-by='.metadata.creationTimestamp' \
+  --field-selector involvedObject.name=agent-env-xxxxxxxx-default-0
+```
+
+Find the Pod name (including the random 8-character identifier) in the `SANDBOX`-level
+logs or the stack trace.
+
+To specify a specific namespace, use the `-n` flag.
+
+## I'm seeing "Helm uninstall failed" errors
+
+These are likely because the Helm chart was never installed. This typically happens if
+you cancel an eval, or an eval fails before a certain sample's Helm chart was installed.
+
+Check to see if any Helm releases were left behind:
+
+```sh
+helm list
+```
+
+## I'm seeing "Handshake status 404 Not Found" errors from Pod operations
+
+This typically indicates that the Pod has been killed. This may be due to cluster issues
+(see how to view cluster events above), or because the eval had already failed and the
+Helm releases were uninstalled whilst some operations were queued or in flight.
+
+Check the `.json` or `.eval` log produced by Inspect to see the underlying error.
diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml
index ebcd76e..b48353d 100644
--- a/docs/mkdocs.yaml
+++ b/docs/mkdocs.yaml
@@ -43,6 +43,7 @@ nav:
   - "Examples": examples.md
   - "Tips":
     - "Debugging K8s Sandboxes": tips/debugging-k8s-sandboxes.md
+    - "Troubleshooting": tips/troubleshooting.md
     - "Docker Images": tips/images.md
     - "Hubble (Cilium UI)": tips/hubble.md
     - "Concurrency": tips/concurrency.md