From 9028af05ea712e78821302a483da04b14ab9a60e Mon Sep 17 00:00:00 2001 From: "Craig.Walton" Date: Thu, 19 Dec 2024 16:48:45 +0000 Subject: [PATCH] Add troubleshooting guide to docs site based on user questions. --- docs/docs/tips/debugging-k8s-sandboxes.md | 27 +++++++++-- docs/docs/tips/troubleshooting.md | 57 +++++++++++++++++++++++ docs/mkdocs.yaml | 1 + 3 files changed, 80 insertions(+), 5 deletions(-) create mode 100644 docs/docs/tips/troubleshooting.md diff --git a/docs/docs/tips/debugging-k8s-sandboxes.md b/docs/docs/tips/debugging-k8s-sandboxes.md index 17fcf53..460e16b 100644 --- a/docs/docs/tips/debugging-k8s-sandboxes.md +++ b/docs/docs/tips/debugging-k8s-sandboxes.md @@ -4,12 +4,17 @@ This section explains features of [Inspect](https://inspect.ai-safety-institute. and [k9s](https://k9scli.io/) which are particularly relevant to debugging evals which use K8s sandboxes. Please see the dedicated docs pages of each for more information. -## Inspect Log Levels +## Capture Inspect `SANDBOX`-level logs { #sandbox-log-level } -Using `--log-level sandbox` (or setting the `INSPECT_LOG_LEVEL` env var, or passing the -`log_level` argument to `eval()`) when running an Inspect eval will give you good -visibility into the Helm charts being installed, the commands being run within the -containers, and the output of those commands. +Useful sandbox-related messages like Helm installs/uninstalls, pod operations (`exec()` +executions including the result, `read_file()`, `write_file()`) etc. are logged at the +`SANDBOX` log level. + +Set Inspect's log level to `SANDBOX` or lower via one of these methods: + + * passing `--log-level sandbox` on the command line + * setting `INSPECT_LOG_LEVEL=sandbox` environment variable + * passing the `log_level` argument to `eval()` or `eval_set()` Example: @@ -35,6 +40,18 @@ SANDBOX - K8S: Completed: Execute command in pod. { } ``` +Additionally, ensure the content of the `logging` module is written to a file on disk: + +```sh +mkdir -p logs +export INSPECT_PY_LOGGER_FILE="logs/inspect_py_log.log" +``` + +These will include timestamps and are invaluable when piecing together an ordered +sequence of events. + +Consider including the datetime in the log file name to keep logs separate. + ## Disabling Inspect Cleanup By default, Inspect will clean up sandboxes (i.e. uninstall Helm releases) after an eval diff --git a/docs/docs/tips/troubleshooting.md b/docs/docs/tips/troubleshooting.md new file mode 100644 index 0000000..cf96562 --- /dev/null +++ b/docs/docs/tips/troubleshooting.md @@ -0,0 +1,57 @@ +# Troubleshooting + +For general K8s and Inspect sandbox debugging, see the [Debugging K8s +Sandboxes](debugging-k8s-sandboxes.md) guide. + +## Capture Inspect `SANDBOX`-level logs + +A good starting point to most issues is to capture the output of the Python `logging` +module at `SANDBOX` level. See the [`SANDBOX` log level +section](debugging-k8s-sandboxes.md#sandbox-log-level). + +## View cluster events + +Certain cluster events may impact your eval, for example, a node failure. + +```sh +kubectl get events --sort-by='.metadata.creationTimestamp' +``` + +To also see timestamps: + +```sh +kubectl get events --sort-by='.metadata.creationTimestamp' \ + -o custom-columns=LastSeen:.lastTimestamp,Type:.type,Object:.involvedObject.name,Reason:.reason,Message:.message +``` + +To filter to a particular release or pod, either pipe into `grep` or use the +`--field-selector` flag: + +```sh +kubectl get events --sort-by='.metadata.creationTimestamp' \ + --field-selector involvedObject.name=agent-env-xxxxxxxx-default-0 +``` + +Find the Pod name (including the random 8-character identifier) in the `SANDBOX`-level +logs or the stack trace. + +To specify a specific namespace, use the `-n` flag. + +## I'm seeing "Helm uninstall failed" errors + +These are likely because the Helm chart was never installed. This typically happens if +you cancel an eval, or an eval fails before a certain sample's Helm chart was installed. + +Check to see if any Helm releases were left behind: + +```sh +helm list +``` + +## I'm seeing "Handshake status 404 Not Found" errors from Pod operations + +This typically indicates that the Pod has been killed. This may be due to cluster issues +(see how to view cluster events above), or because the eval had already failed and the +Helm releases were uninstalled whilst some operations were queued or in flight. + +Check the `.json` or `.eval` log produced by Inspect to see the underlying error. diff --git a/docs/mkdocs.yaml b/docs/mkdocs.yaml index ebcd76e..b48353d 100644 --- a/docs/mkdocs.yaml +++ b/docs/mkdocs.yaml @@ -43,6 +43,7 @@ nav: - "Examples": examples.md - "Tips": - "Debugging K8s Sandboxes": tips/debugging-k8s-sandboxes.md + - "Troubleshooting": tips/troubleshooting.md - "Docker Images": tips/images.md - "Hubble (Cilium UI)": tips/hubble.md - "Concurrency": tips/concurrency.md