Skip to content

Commit

Permalink
[docs] Move troubleshooting section to top level of ToC (#8145)
Browse files Browse the repository at this point in the history
This moves the Troubleshoot ECK section up to the top level of the ECK Guide to make this section more prominent, and because the troubleshooting topics may apply more broadly than only "operating ECK".
  • Loading branch information
kilfoyle authored Oct 28, 2024
1 parent 41cae15 commit a50d41d
Show file tree
Hide file tree
Showing 9 changed files with 29 additions and 15 deletions.
Binary file added .DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/help.asciidoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
If you are an existing Elastic customer with an active support contract, you can create a case in the link:https://support.elastic.co/[Elastic Support Portal]. Kindly attach an <<{p}-take-eck-dump,ECK diagnostic>> when opening your case.
If you are an existing Elastic customer with an active support contract, you can create a case in the link:https://support.elastic.co/[Elastic Support Portal]. Kindly attach an <<{p}-run-eck-diagnostics,ECK diagnostic>> when opening your case.

Alternatively, or if you do not have a support contract, and if you are unable to find a solution to your problem with the information provided in these documents, ask for help:

Expand Down
2 changes: 1 addition & 1 deletion docs/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ include::quickstart.asciidoc[]
include::operating-eck/operating-eck.asciidoc[]
include::orchestrating-elastic-stack-applications/orchestrating-elastic-stack-applications.asciidoc[]
include::advanced-topics/advanced-topics.asciidoc[]
include::troubleshooting.asciidoc[l]
include::reference/reference.asciidoc[]

include::release-notes/highlights.asciidoc[]
include::release-notes.asciidoc[]
2 changes: 1 addition & 1 deletion docs/operating-eck/air-gapped.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,6 @@ For example, if your private registry is `my.registry` and all Elastic images ar
[id="{p}-eck-diag-air-gapped"]
== ECK Diagnostics in air-gapped environments

The <<{p}-take-eck-dump,eck-diagnostics tool>> optionally runs diagnostics for Elastic Stack applications in a separate container that is deployed into the Kubernetes cluster.
The <<{p}-run-eck-diagnostics,eck-diagnostics tool>> optionally runs diagnostics for Elastic Stack applications in a separate container that is deployed into the Kubernetes cluster.

In air-gapped environments with no access to the `docker.elastic.co` registry, you should copy the latest support-diagnostics container image to your internal image registry and then run the tool with the additional flag `--diagnostic-image <custom-support-diagnostics-image-name>`. To find out which support diagnostics container image matches your version of eck-diagnostics run the tool once without arguments and it will print the default image in use.
2 changes: 0 additions & 2 deletions docs/operating-eck/operating-eck.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ endif::[]
- <<{p}-configure-operator-metrics>>
- <<{p}-restrict-cross-namespace-associations>>
- <<{p}-licensing>>
- <<{p}-troubleshooting>>
- <<{p}-installing-eck>>
- <<{p}-upgrading-eck>>
- <<{p}-uninstalling-eck>>
Expand All @@ -28,7 +27,6 @@ include::webhook.asciidoc[leveloffset=+1]
include::configure-operator-metrics.asciidoc[leveloffset=+1]
include::restrict-cross-namespace-associations.asciidoc[leveloffset=+1]
include::licensing.asciidoc[leveloffset=+1]
include::troubleshooting.asciidoc[leveloffset=+1]
include::installing-eck.asciidoc[leveloffset=+1]
include::upgrading-eck.asciidoc[leveloffset=+1]
include::uninstalling-eck.asciidoc[leveloffset=+1]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ link:https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-{page_id}.html[View
****
endif::[]
[id="{p}-{page_id}"]
= Troubleshoot ECK
= Troubleshooting ECK

- <<{p}-common-problems>>
- <<{p}-troubleshooting-methods>>
- <<{p}-run-eck-diagnostics>>
include::../help.asciidoc[]
include::./help.asciidoc[]

include::troubleshooting/common-problems.asciidoc[leveloffset=+1]
include::troubleshooting/troubleshooting-methods.asciidoc[leveloffset=+1]
include::troubleshooting/take-eck-dump.asciidoc[leveloffset=+1]
include::troubleshooting/run-eck-diagnostics.asciidoc[leveloffset=+1]
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ endif::[]
[id="{p}-{page_id}"]
= Common problems

[float]
[id="{p}-{page_id}-operator-oom"]
== Operator crashes on startup with `OOMKilled`

Expand Down Expand Up @@ -59,6 +60,7 @@ kubectl patch sts elastic-operator -n elastic-system -p '{"spec":{"template":{"s

NOTE: Set limits (`spec.containers[].resources.limits`) that match requests (`spec.containers[].resources.requests`) to prevent operator's Pod from being terminated during link:https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/[node-pressure eviction].

[float]
[id="{p}-{page_id}-webhook-timeout"]
== Timeout when submitting a resource manifest

Expand All @@ -71,6 +73,7 @@ Error from server (Timeout): error when creating "elasticsearch.yaml": Timeout:

This error is usually an indication of a problem communicating with the validating webhook. If you are running ECK on a private Google Kubernetes Engine (GKE) cluster, you may need to link:https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules[add a firewall rule] allowing port 9443 from the API server. Another possible cause for failure is if a strict network policy is in effect. Refer to the <<{p}-webhook-troubleshooting-timeouts,webhook troubleshooting>> documentation for more details and workarounds.

[float]
[id="{p}-{page_id}-owner-refs"]
== Copying secrets with Owner References

Expand Down Expand Up @@ -128,6 +131,7 @@ type: Opaque

Failure to do so can cause data loss.

[float]
[id="{p}-{page_id}-scale-down"]
== Scale down of Elasticsearch master-eligible Pods seems stuck

Expand Down Expand Up @@ -160,6 +164,7 @@ Then, scale down the StatefulSet to the right size `m`, removing the pending Pod

CAUTION: Do not use this method to scale down Pods that have already joined the Elasticsearch cluster, as additional data loss protection that ECK applies is sidestepped.

[float]
[id="{p}-{page_id}-pod-updates"]
== Pods are not replaced after a configuration update

Expand Down Expand Up @@ -235,6 +240,7 @@ In this case, you have to add more K8s nodes, or free up resources.

For more information, check <<{p}-troubleshooting-methods>>.

[float]
[id="{p}-{page_id}-olm-upgrade"]
== ECK operator upgrade stays pending when using OLM

Expand All @@ -254,13 +260,15 @@ If you are using one of the affected versions of OLM and upgrading OLM to a newe
can still be upgraded by uninstalling and reinstalling it. This can be done by removing the `Subscription` and both `ClusterServiceVersion` resources and adding them again.
On OpenShift the same workaround can be performed in the UI by clicking on "Uninstall Operator" and then reinstalling it through OperatorHub.

[float]
[id="{p}-{page_id}-version-downgrade"]
== If you upgraded Elasticsearch to the wrong version
If you accidentally upgrade one of your Elasticsearch clusters to a version that does not exist or a version to which a direct upgrade is not possible from your currently deployed version, a validation will prevent you from going back to the previous version.
The reason for this validation is that ECK will not allow downgrades as this is not supported by Elasticsearch and once the data directory of Elasticsearch has been upgraded there is no way back to the old version without a link:https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-upgrade.html[snapshot restore].

These two upgrading scenarios, however, are exceptions because Elasticsearch never started up successfully. If you annotate the Elasticsearch resource with `eck.k8s.elastic.co/disable-downgrade-validation=true` ECK allows you to go back to the old version at your own risk. If you also attempted an upgrade of other related Elastic Stack applications at the same time you can use the same annotation to go back. Remove the annotation afterwards to prevent accidental downgrades and reduced availability.

[float]
[id="{p}-{page_id}-815-reconfigure-role-mappings"]
== Reconfigure stack config policy based role mappings after an upgrade to 8.15.3 from 8.14.x or 8.15.x

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
:page_id: take-eck-dump
:page_id: run-eck-diagnostics
ifdef::env-github[]
****
link:https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-{page_id}.html[View this document on the Elastic website]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,15 @@ Most common issues can be identified and resolved by following these instruction
- <<{p}-exclude-resource,Exclude a resource from reconciliation>>
- <<{p}-get-k8s-events,Get Kubernetes events>>
- <<{p}-exec-into-containers,Exec into containers>>
- <<{p}-resize-pv>>
- <<{p}-suspend-elasticsearch>>
- <<{p}-capture-jvm-heap-dumps>>
If you are still unable to find a solution to your problem, ask for help:

include::../../help.asciidoc[]

include::./../help.asciidoc[]

[float]
[id="{p}-get-resources"]
== View the list of resources

Expand Down Expand Up @@ -55,6 +56,7 @@ elasticsearch-sample-es-http ClusterIP 10.19.248.93 <none> 9200/TC
kibana-sample-kb-http ClusterIP 10.19.246.116 <none> 5601/TCP 3d
----

[float]
[id="{p}-describe-failing-resources"]
== Describe failing resources

Expand Down Expand Up @@ -101,6 +103,7 @@ Events:

If you get an error with unbound persistent volume claims (PVCs), it means there is not currently a persistent volume that can satisfy the claim. If you are using automatically provisioned storage (for example Amazon EBS provisioner), sometimes the storage provider can take a few minutes to provision a volume, so this may resolve itself in a few minutes. You can also check the status by running `kubectl describe persistentvolumeclaims` to monitor events of the PVCs.

[float]
[id="{p}-eck-debug-logs"]
== Enable ECK debug logs

Expand All @@ -124,6 +127,7 @@ change the `args` array as follows:

Once your change is saved, the operator is automatically restarted by the StatefulSet controller to apply the new settings.

[float]
[id="{p}-view-logs"]
== View logs

Expand Down Expand Up @@ -172,7 +176,7 @@ Logs with `ERROR` level indicate something is not going as expected.
Due to link:https://github.com/eBay/Kubernetes/blob/master/docs/devel/api-conventions.md#concurrency-control-and-consistency[optimistic locking],
you can get errors reporting a conflict while updating a resource. You can ignore them, as the update goes through at the next reconciliation attempt, which will happen almost immediately.


[float]
[id="{p}-resource-level-config"]
== Configure Elasticsearch timeouts

Expand All @@ -188,7 +192,7 @@ To set the Elasticsearch client timeout to 60 seconds for a cluster named `quick
kubectl annotate elasticsearch quickstart eck.k8s.elastic.co/es-client-timeout=60s
----


[float]
[id="{p}-exclude-resource"]
== Exclude resources from reconciliation

Expand All @@ -212,6 +216,7 @@ Or in one line:
kubectl annotate elasticsearch quickstart --overwrite eck.k8s.elastic.co/managed=false
----

[float]
[id="{p}-get-k8s-events"]
== Get Kubernetes events

Expand Down Expand Up @@ -246,7 +251,7 @@ LAST SEEN FIRST SEEN COUNT NAME KIND
You can set filters for Kibana and APM Server too.
Note that the default TTL for events in Kubernetes is 1h, so unless your cluster settings have been modified you will not get events older than 1h.


[float]
[id="{p}-resize-pv"]
== Resizing persistent volumes

Expand Down Expand Up @@ -307,6 +312,7 @@ spec:

and ECK will automatically create a new StatefulSet and begin migrating data into it.

[float]
[id="{p}-exec-into-containers"]
== Exec into containers

Expand All @@ -319,6 +325,7 @@ kubectl exec -ti elasticsearch-sample-es-p45nrjch29 bash

This can also be done for Kibana and APM Server.

[float]
[id="{p}-suspend-elasticsearch"]
== Suspend Elasticsearch

Expand Down Expand Up @@ -348,7 +355,7 @@ Once you are done with troubleshooting the node, you can resume normal operation
kubectl annotate es quickstart eck.k8s.elastic.co/suspend-
----


[float]
[id="{p}-capture-jvm-heap-dumps"]
== Capture JVM heap dumps

Expand Down

0 comments on commit a50d41d

Please sign in to comment.