Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(OSD-22433) Update SOP to include Stop/Start of ec2 Instance #181

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 15 additions & 8 deletions alerts/cluster-etcd-operator/etcdMembersDown.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ Login to the cluster. Check health of master nodes if any of them is in
`NotReady` state or not.

```console
$ oc get nodes -l node-role.kubernetes.io/master=
oc get nodes -l node-role.kubernetes.io/master=
```

Check if an upgrade is in progress.

```console
$ oc adm upgrade
oc adm upgrade
```

In case there is no upgrade going on, but there is a change in the
Expand All @@ -39,7 +39,7 @@ the master nodes. This is the case when the [machine-config-operator
(MCO)](https://github.com/openshift/machine-config-operator) is working on it.

```console
$ oc get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
oc get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\n"}}{{end}}'
```

### General etcd health
Expand All @@ -48,19 +48,19 @@ To run `etcdctl` commands, we need to `rsh` into the `etcdctl` container of any
etcd pod.

```console
$ oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
```

Validate that the `etcdctl` command is available:

```console
$ etcdctl version
etcdctl version
```

Run the following command to get the health of etcd:

```console
$ etcdctl endpoint health -w table
etcdctl endpoint health -w table
```

## Mitigation
Expand All @@ -69,6 +69,13 @@ If an upgrade is in progress, the alert may automatically resolve in some time
when the master node comes up again. If MCO is not working on the master node,
check the cloud provider to verify if the master node instances are running or not.

In the case when you are running on AWS, the AWS instance retirement might need
a manual reboot of the master node.
### Restarting Instance in AWS

If the master node is unhealthy you can try stop/starting the instance
in AWS. Log into the AWS account of the cluster and find the instance of
the affected master node by searching the running ec2 instances by the node
name. Click the instance and at the top right select "Instance state", and
"Stop instance." After the instance stops you can repeat the process and
choose "Start instance."

![Stop/Start Instance Button](img/ec2-stop-start.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.