Skip to content

Commit

Permalink
Update etcd defrag and backup
Browse files Browse the repository at this point in the history
This commit updates etcd defrag and backup script as follows:
- Script exits without any defrag/backup/trim action if:
  - It is executed on non leader etcd member
  - It is executed on single member etcd cluster
- Script defragment the etcd cluster as follows:
  - Defrag the non leader etcd members first
  - Change the leadership to the randomly selected and defragmentation completed etcd member
  - Defrag the local (ex-leader) etcd member
- Script then backup & trim local (ex-leader) etcd member

This script executes etcdctl commands like `etcdctl move-leader` or
`etcdctl endpoint status --cluster` which were introduced in etcdctl
version 3.3.0. The previous etcdctl client was installed as an `apt`
package. The latest etcdctl version available in Ubuntu 20.04
repositories is v3.2.26, hence this commit also introduces `etcdctl_version`
variable that contains the desired version of etcdctl client.
Etcdctl client is then used for etcd DB maintenance tasks.

Issue #384

Signed-off-by: Matej Feder <matej.feder@dnation.cloud>
  • Loading branch information
matofeder committed Mar 17, 2023
1 parent 6da1930 commit 73f9c1d
Show file tree
Hide file tree
Showing 9 changed files with 89 additions and 18 deletions.
1 change: 1 addition & 0 deletions OLD_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ Parameters controlling the cluster creation:
| `etcd_unsafe_fs` | `ETCD_UNSAFE_FS` | SCS | `false` | Use `barrier=0` for filesystem on control nodes to avoid storage latency. Use for multi-controller clusters on slow/networked storage, otherwise not recommended. |
| `testcluster_name` | (cmd line) | SCS | `testcluster` | Allows setting the default cluster name, created at bootstrap (if `controller_count` is larger than 0) |
| `restrict_kubeapi` | `RESTRICT_KUBEAPI` | SCS | `[ ]` | Allows restricting access to kubernetes API by list of CIDRs. Empty list (default) means public, `[ "none" ]` means internal access only. |
| `etcdctl_version` | `ETCDCTL_VERSION` | SCS | `v3.5.7` | Version of the etcdctl client that is used for etcd DB maintenance tasks |

Optional services deployed to cluster:

Expand Down
1 change: 1 addition & 0 deletions doc/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ Parameters controlling the cluster creation:
| `etcd_unsafe_fs` | `ETCD_UNSAFE_FS` | SCS | `false` | Use `barrier=0` for filesystem on control nodes to avoid storage latency. Use for multi-controller clusters on slow/networked storage, otherwise not recommended. |
| `testcluster_name` | (cmd line) | SCS | `testcluster` | Allows setting the default cluster name, created at bootstrap (if `controller_count` is larger than 0) |
| `restrict_kubeapi` | `RESTRICT_KUBEAPI` | SCS | `[ ]` | Allows restricting access to kubernetes API by list of CIDRs. Empty list (default) means public, `[ "none" ]` means internal access only. |
| `etcdctl_version` | `ETCDCTL_VERSION` | SCS | `v3.5.7` | Version of the etcdctl client that is used for etcd DB maintenance tasks |

Optional services deployed to cluster:

Expand Down
1 change: 1 addition & 0 deletions terraform/environments/environment-default.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ cilium_binaries = "<v0.aa.bb;v0.xx.yy>" # defaults to "v0.13.1;v0.11.2"
kubernetes_version = "<v1.XX.XX>" # defaults to "v1.25.x"
kube_image_raw = "<boolean>" # defaults to "true"
calico_version = "<v3.xx.y>" # defaults to "v3.25.0"
etcdctl_version = "<v3.x.yy>" # defaults to "v3.5.7"
controller_flavor = "<flavor>" # defaults to SCS-2V-4-20s (use etcd tweaks if you only have SCS-2V-4-20 in multi-controller setups)
worker_flavor = "<flavor>" # defaults to SCS-2V-4-20 (larger helps)
controller_count = <number> # defaults to 1 (0 skips testcluster creation)
Expand Down
6 changes: 5 additions & 1 deletion terraform/files/bin/create_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ fi
fixup_k8sregistry.sh "$CCCFG" "${CLUSTERAPI_TEMPLATE}"

cp -p "$CCCFG" $HOME/.cluster-api/clusterctl.yaml
KCCCFG="--config $CCCFG"
KCCCFG=":w2 $CCCFG"
#clusterctl $KCCCFG config cluster ${CLUSTER_NAME} --list-variables --from ${CLUSTERAPI_TEMPLATE}
clusterctl $KCCCFG generate cluster "${CLUSTER_NAME}" --list-variables --from ${CLUSTERAPI_TEMPLATE} || exit 2

Expand Down Expand Up @@ -168,6 +168,10 @@ else
kubectl $KCONTEXT apply -f ~/$CLUSTER_NAME/deployed-manifests.d/calico.yaml
fi

# Etcd defrag & backup script is expected as a secret in cluster-template.yaml
kubectl create secret generic etcd-defrag --from-file=data=etcd-defrag.sh --dry-run=client -oyaml > ~/$CLUSTER_NAME/deployed-manifests.d/etcd-defrag.yaml || exit 3
kubectl $KCONTEXT apply -f ~/$CLUSTER_NAME/deployed-manifests.d/etcd-defrag.yaml

# OpenStack, Cinder
apply_openstack_integration.sh "$CLUSTER_NAME" || exit $?
apply_cindercsi.sh "$CLUSTER_NAME" || exit $?
Expand Down
63 changes: 63 additions & 0 deletions terraform/files/bin/etcd-defrag.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash
# Defragment & backup & trim script for SCS k8s-cluster-api-provider etcd cluster
# Script exits without any defrag/backup/trim action if:
# - It is executed on non leader etcd member
# - It is executed on single member etcd cluster
# Script defragment the etcd cluster as follows:
# - Defrag the non leader etcd members first
# - Change the leadership to the randomly selected and defragmentation completed etcd member
# - Defrag the local (ex-leader) etcd member
# Script then backup & trim local (ex-leader) etcd member
#
# Usage: etcd-defrag.sh

export LOG_DIR=/var/log
export ETCDCTL_API=3
ETCDCTL="etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt"

if test "$($ETCDCTL endpoint status | cut -d ',' -f 5 | tr -d [:blank:])" != "false"; then
echo "Exit on non leader"
exit 0
fi

# Get all etcd members with their endpoints, IDs, and leader status
declare -a MEMBERS
while read MEMBER; do
MEMBERS+=( "$MEMBER" )
done < <($ETCDCTL endpoint status --cluster)

if test ${#MEMBERS[@]} = 1; then
echo "Exit on single member etcd"
exit 0
fi

declare -a NON_LEADER_IDS
for MEMBER in "${MEMBERS[@]}"; do
# Get member ID, endpoint, and leader status
MEMBER_ENDPOINT=$(echo "$MEMBER" | cut -d ',' -f 1 | tr -d [:blank:])
MEMBER_ID=$(echo "$MEMBER" | cut -d ',' -f 2 | tr -d [:blank:])
MEMBER_IS_LEADER=$(echo "$MEMBER" | cut -d ',' -f 5 | tr -d [:blank:])
# Defragment if $MEMBER is not the leader
if test "$MEMBER_IS_LEADER" == "false"; then
echo "Etcd member ${MEMBER_ENDPOINT} is not the leader, let's defrag it!"
$ETCDCTL --endpoints="$MEMBER_ENDPOINT" defrag
NON_LEADER_IDS+=( "$MEMBER_ID" )
fi
done

# Randomly pick an ID from non-leader IDs and make it a leader
RANDOM_NON_LEADER_ID=${NON_LEADER_IDS[ $(($RANDOM % ${#NON_LEADER_IDS[@]})) ]}
echo "Member ${RANDOM_NON_LEADER_ID} is becoming the leader"
$ETCDCTL move-leader $RANDOM_NON_LEADER_ID

# Defrag this ex-leader etcd member
sync
sleep 2
$ETCDCTL defrag

# Backup&trim this ex-leader etcd member
sleep 3
$ETCDCTL snapshot save /root/etcd-backup
chmod 0600 /root/etcd-backup
xz -f /root/etcd-backup
fstrim -v /var/lib/etcd
27 changes: 10 additions & 17 deletions terraform/files/template/cluster-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,22 +81,10 @@ spec:
- path: /root/etcd-defrag.sh
owner: "root:root"
permissions: "0755"
content: |
#!/bin/bash
export LOG_DIR=/var/log
export ETCDCTL_API=3
if test "$(etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt endpoint status | cut -d ',' -f 5)" != " false"; then
echo "Exit on leader"
exit 0
fi
sync
sleep 2
etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt defrag
sleep 3
etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt snapshot save /root/etcd-backup
chmod 0600 /root/etcd-backup
xz -f /root/etcd-backup
fstrim -v /var/lib/etcd
contentFrom:
secret:
name: etcd-defrag
key: data
- path: /etc/systemd/system/etcd-defrag.service
owner: "root:root"
permissions: "0644"
Expand Down Expand Up @@ -144,8 +132,13 @@ spec:
- apt-get update -y
- TRIMMED_KUBERNETES_VERSION=$(echo ${KUBERNETES_VERSION} | sed 's/\./\./g' | sed 's/^v//')
- RESOLVED_KUBERNETES_VERSION=$(apt-cache policy kubelet | sed 's/\*\*\*//' | awk -v VERSION=$${TRIMMED_KUBERNETES_VERSION} '$1~ VERSION { print $1 }' | head -n1)
- apt-get install -y ca-certificates socat jq ebtables apt-transport-https cloud-utils prips containerd etcd-client kubelet=$${RESOLVED_KUBERNETES_VERSION} kubeadm=$${RESOLVED_KUBERNETES_VERSION} kubectl=$${RESOLVED_KUBERNETES_VERSION}
- apt-get install -y ca-certificates socat jq ebtables apt-transport-https cloud-utils prips containerd kubelet=$${RESOLVED_KUBERNETES_VERSION} kubeadm=$${RESOLVED_KUBERNETES_VERSION} kubectl=$${RESOLVED_KUBERNETES_VERSION}
- systemctl daemon-reload
# Install etcdctl
- curl -L https://github.com/coreos/etcd/releases/download/${ETCDCTL_VERSION}/etcd-${ETCDCTL_VERSION}-linux-amd64.tar.gz -o etcd-${ETCDCTL_VERSION}-linux-amd64.tar.gz
- tar xzvf etcd-${ETCDCTL_VERSION}-linux-amd64.tar.gz
- sudo cp etcd-${ETCDCTL_VERSION}-linux-amd64/etcdctl /usr/local/bin/
- rm -rf etcd-${ETCDCTL_VERSION}-linux-amd64 etcd-${ETCDCTL_VERSION}-linux-amd64.tar.gz
# TODO: Detect local SSD and mkfs/mount /var/lib/etcd
version: "${KUBERNETES_VERSION}"
---
Expand Down
1 change: 1 addition & 0 deletions terraform/files/template/clusterctl.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ WORKER_MACHINE_GEN: genw01
# Openstack Availability Zone
OPENSTACK_FAILURE_DOMAIN: ${availability_zone}

ETCDCTL_VERSION: ${etcdctl_version}
ETCD_UNSAFE_FS: ${etcd_unsafe_fs}

# Nodes CIDR
Expand Down
1 change: 1 addition & 0 deletions terraform/mgmtcluster.tf
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ EOF
deploy_nginx_ingress = var.deploy_nginx_ingress,
deploy_occm = var.deploy_occm,
dns_nameservers = var.dns_nameservers,
etcdctl_version = var.etcdctl_version,
etcd_unsafe_fs = var.etcd_unsafe_fs,
external = var.external,
image_registration_extra_flags = var.image_registration_extra_flags,
Expand Down
6 changes: 6 additions & 0 deletions terraform/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,12 @@ variable "cilium_binaries" {
default = "v0.13.1;v0.11.2"
}

variable "etcdctl_version" {
description = "desired version of etcdctl client that is used for etcd DB maintenance tasks"
type = string
default = "v3.5.7"
}

variable "etcd_unsafe_fs" {
description = "mount controller root fs with nobarrier"
type = bool
Expand Down

0 comments on commit 73f9c1d

Please sign in to comment.