We use similar approach as aerospike: watch GCP maintenance events on TiDB/TiKV/PD nodes and take proper actions:
- TiDB: Put the TiDB offline by cordon the TiDB node and delete the TiDB pod (the node pool of TiDB instance MUST be set to auto-scale, the cordon node is expected to be reclaimed by auto-scaler)
- TiKV: Ecivt leaders on TiKV store during maintenance.
- PD: Resign leader if the current PD instance is the PD leader
An additional container is added to run the maintenance watching script.
Used Sidecar Public image: pingcap/tidb-gcp-live-migration:${TIDB_VERSION} (e.g. pingcap/tidb-gcp-live-migration:v7.1.0)
or build the image on your own: TIDB_VERSION=v7.1.0 IMAGE=${YOUR_IMAGE}/tidb-gcp-live-migration make image-release
For TiDB, add content below to spec.tidb (replace ${CLUSTR_NAME})
run
# replace ${SERVICEACCOUNT}, ${NAMESPACE} and ${CLUSTR_NAME}
kubectl apply -f rbac.yaml
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: true
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: tidb
- name: NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
For TiKV, add content below to spec.tikv (replace ${CLUSTR_NAME})
additionalVolumes:
- name: pd-tls
secret:
secretName: ${CLUSTR_NAME}-pd-cluster-secret
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: true
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: tikv
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
volumeMounts:
- name: pd-tls
mountPath: /var/lib/pd-tls
- name: tikv-tls
mountPath: /var/lib/tikv-tls
For PD, add content below to spec.pd (replace ${CLUSTR_NAME}),
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: true
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: PD
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
volumeMounts:
- name: pd-tls
mountPath: /var/lib/pd-tls
For TiDB, add content below to spec.tidb (replace ${CLUSTR_NAME})
run
# replace ${SERVICEACCOUNT}, ${NAMESPACE} and ${CLUSTR_NAME}
kubectl apply -f rbac.yaml
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: false
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: tidb
- name: NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
For TiKV, add content below to spec.tikv (replace ${CLUSTR_NAME})
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: false
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: tikv
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
For PD, add content below to spec.pd (replace ${CLUSTR_NAME}),
additionalContainers:
- command:
- python3
- /main.py
env:
- name: TLS
value: false
- name: CLUSTER_NAME
value: ${CLUSTR_NAME}
- name: ROLE
value: PD
image: pingcap/tidb-gcp-live-migration:v7.1.0
name: gcp-maintenance-script
Increase the PD leader-schedule limit after the cluster is deployed, through sql:
set config pd `leader-schedule-limit`=100;
see: https://cloud.google.com/compute/docs/instances/simulating-host-maintenance
And a script to simulate the effect when live migration happened.