Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support restart of service pods (#103) #2377

Merged
merged 1 commit into from
Nov 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Restart service Pods in a NebulaGraph cluster on K8s

!!! note

Restarting NebulaGraph cluster service Pods is a feature in the Alpha version.

During routine maintenance, it might be necessary to restart a specific service Pod in the NebulaGraph cluster, for instance, when the Pod's status is abnormal or to enforce a restart. Restarting a Pod essentially means restarting the service process. To ensure high availability, NebulaGraph Operator supports gracefully restarting all Pods of the Graph, Meta, or Storage service respectively and gracefully restarting an individual Pod of the Storage service.

## Prerequisites

A NebulaGraph cluster is created in a K8s environment. For details, see [Create a NebulaGraph cluster](../4.1.installation/4.1.1.cluster-install.md).

## Restart all Pods of a certain service type

To gracefully roll restart all Pods of a certain service type in the cluster, you can add an annotation (`nebula-graph.io/restart-timestamp`) with the current time to the configuration of the StatefulSet controller of the corresponding service.

When NebulaGraph Operator detects that the StatefulSet controller of the corresponding service has the annotation `nebula-graph.io/restart-timestamp` and its value is changed, it triggers the graceful rolling restart operation for all Pods of that service type in the cluster.

In the following example, the annotation is added for all Graph services so that all Pods of these Graph services are restarted one by one.

Assume that the cluster name is `nebula` and the cluster resources are in the `default` namespace. Run the following command:

1. Check the name of the StatefulSet controller.

```bash
kubectl get statefulset
```

Sample output:

```bash
NAME READY AGE
nebula-graphd 2/2 33s
nebula-metad 3/3 69s
nebula-storaged 3/3 69s
```

2. Get the current timestamp.

```bash
date -u +%s
```

Example output:

```bash
1700547115
```

3. Overwrite the timestamp annotation of the StatefulSet controller to trigger the graceful rolling restart operation.

```bash
kubectl annotate statefulset nebula-graphd nebula-graph.io/restart-timestamp="1700547115" --overwrite
```

Example output:

```bash
statefulset.apps/nebula-graphd annotate
```

4. Observe the restart process.

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=graphd -w
```

Example output:

```bash
NAME READY STATUS RESTARTS AGE
nebula-graphd-0 1/1 Running 0 9m37s
nebula-graphd-1 0/1 Running 0 17s
nebula-graphd-1 1/1 Running 0 20s
nebula-graphd-0 1/1 Terminating 0 9m40s
nebula-graphd-0 0/1 Terminating 0 9m41s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Pending 0 0s
nebula-graphd-0 0/1 Pending 0 0s
nebula-graphd-0 0/1 ContainerCreating 0 0s
nebula-graphd-0 0/1 Running 0 2s
```

This above output shows the status of Graph service Pods during the restart process.

5. Verify that the StatefulSet controller annotation is updated.

```bash
kubectl get statefulset nebula-graphd -o yaml | grep "nebula-graph.io/restart-timestamp"

```

Example output:

```yaml
nebula-graph.io/last-applied-configuration: '{"persistentVolumeClaimRetentionPolicy":{"whenDeleted":"Retain","whenScaled":"Retain"},"podManagementPolicy":"Parallel","replicas":2,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"serviceName":"nebula-graphd-headless","template":{"metadata":{"annotations":{"nebula-graph.io/cm-hash":"7c55c0e5ac74e85f","nebula-graph.io/restart-timestamp":"1700547815"},"creationTimestamp":null,"labels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"spec":{"containers":[{"command":["/bin/sh","-ecx","exec
nebula-graph.io/restart-timestamp: "1700547115"
nebula-graph.io/restart-timestamp: "1700547815"
```

The above output indicates that the annotation of the StatefulSet controller has been updated, and all graph service Pods has been restarted.


## Restart a single Storage service Pod

To gracefully roll restart a single Storage service Pod, you can add an annotation (`nebula-graph.io/restart-ordinal`) with the value set to the ordinal number of the Storage service Pod you want to restart. This triggers a graceful restart or state transition for that specific Storage service Pod. The added annotation will be automatically removed after the Storage service Pod is restarted.

In the following example, the annotation is added for the Pod with ordinal number `1`, indicating a graceful restart for the `nebula-storaged-1` Storage service Pod.

Assume that the cluster name is `nebula`, and the cluster resources are in the `default` namespace. Run the following commands:

1. Check the name of the StatefulSet controller.

```bash
kubectl get statefulset
```

Example output:

```bash
NAME READY AGE
nebula-graphd 2/2 33s
nebula-metad 3/3 69s
nebula-storaged 3/3 69s
```

2. Get the ordinal number of the Storage service Pod.

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=storaged
```

Example output:

```bash
NAME READY STATUS RESTARTS AGE
nebula-storaged-0 1/1 Running 0 13h
nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-2 1/1 Running 0 13h
nebula-storaged-3 1/1 Running 0 13h
nebula-storaged-4 1/1 Running 0 13h
nebula-storaged-5 1/1 Running 0 13h
nebula-storaged-6 1/1 Running 0 13h
nebula-storaged-7 1/1 Running 0 13h
nebula-storaged-8 1/1 Running 0 13h
```

3. Add the annotation for the `nebula-storaged-1` pod to trigger a graceful restart for that specific pod.

```bash
kubectl annotate statefulset nebula-storaged nebula-graph.io/restart-ordinal="1"
```

Example output:

```bash
statefulset.apps/nebula-storaged annotate
```

4. Observe the restart process.

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=storaged -w
```

Example output:

```bash
NAME READY STATUS RESTARTS AGE
nebula-storaged-0 1/1 Running 0 13h
nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-2 1/1 Running 0 13h
nebula-storaged-3 1/1 Running 0 13h
nebula-storaged-4 1/1 Running 0 13h
nebula-storaged-5 1/1 Running 0 12h
nebula-storaged-6 1/1 Running 0 12h
nebula-storaged-7 1/1 Running 0 12h
nebula-storaged-8 1/1 Running 0 12h


nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-1 1/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Pending 0 0s
nebula-storaged-1 0/1 Pending 0 0s
nebula-storaged-1 0/1 ContainerCreating 0 0s
nebula-storaged-1 0/1 Running 0 1s
nebula-storaged-1 1/1 Running 0 10s

The above output indicates that the `nebula-storaged-1` Storage service Pod has been successfully restarted.
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# 重启 K8s 上的{{nebula.name}}集群服务

!!! note

重启集群服务 Pod 功能为 Alpha 版本功能。

在日常维护时,出于各种原因需要重启{{nebula.name}}集群的某个服务 Pod,例如 Pod 状态异常或是执行强行重启逻辑。Pod 重启的本质是重启服务进程,为了确保服务的高可用性,NebulaGraph Operator 支持优雅滚动重启集群内所有 Graph,Meta,或 Storage 服务 Pod,也支持优雅重启单个 Storage 服务 Pod。

## 前提条件

已经在 K8s 环境中创建了一个{{nebula.name}}集群。具体步骤,参见[创建{{nebula.name}}集群](../4.1.installation/4.1.1.cluster-install.md)。

## 优雅滚动重启集群内某类服务的所有 Pod

通过在不同服务 StatefulSet 控制器的配置中添加注解(annotation)`nebula-graph.io/restart-timestamp`并将值设置为当前时间来实现优雅滚动重启集群同类服务 Pod。当 NebulaGraph Operator 检测到相应服务的 StatefulSet 控制器存在注解`nebula-graph.io/restart-timestamp`并且其值发生了变更,即会触发优雅滚动重启集群内某类服务的所有 Pod 的操作。

以下示例中,为所有 Graph 服务都设置注解,表示将逐个重启所有 Graph 服务 Pod。

假设所有集群名为`nebula`,集群资源都放在`default`命名空间下,执行以下命令:


1. 查看 StatefulSet 控制器的名称。

```bash
kubectl get statefulset
```

示例输出:

```bash
NAME READY AGE
nebula-graphd 2/2 33s
nebula-metad 3/3 69s
nebula-storaged 3/3 69s
```

2. 获取当前时间戳。

```bash
date -u +%s
```
示例输出:

```bash
1700547115
```

3. 覆盖 StatefulSet 控制器时间戳注解以触发优雅滚动重启操作。

```bash
kubectl annotate statefulset nebula-graphd nebula-graph.io/restart-timestamp="1700547115" --overwrite
```

示例输出:

```bash
statefulset.apps/nebula-graphd annotate
```

4. 观察重启过程。

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=graphd -w
```

示例输出:

```bash
NAME READY STATUS RESTARTS AGE
nebula-graphd-0 1/1 Running 0 9m37s
nebula-graphd-1 0/1 Running 0 17s
nebula-graphd-1 1/1 Running 0 20s
nebula-graphd-0 1/1 Terminating 0 9m40s
nebula-graphd-0 0/1 Terminating 0 9m41s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Terminating 0 9m42s
nebula-graphd-0 0/1 Pending 0 0s
nebula-graphd-0 0/1 Pending 0 0s
nebula-graphd-0 0/1 ContainerCreating 0 0s
nebula-graphd-0 0/1 Running 0 2s
```

上述输出显示所有 Graph 服务 Pod 重启过程。

5. 确认 StatefulSet 控制器注解更新。

```bash
kubectl get statefulset nebula-graphd -o yaml | grep "nebula-graph.io/restart-timestamp"
```

示例输出:

```yaml
nebula-graph.io/last-applied-configuration: '{"persistentVolumeClaimRetentionPolicy":{"whenDeleted":"Retain","whenScaled":"Retain"},"podManagementPolicy":"Parallel","replicas":2,"revisionHistoryLimit":10,"selector":{"matchLabels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"serviceName":"nebula-graphd-headless","template":{"metadata":{"annotations":{"nebula-graph.io/cm-hash":"7c55c0e5ac74e85f","nebula-graph.io/restart-timestamp":"1700547815"},"creationTimestamp":null,"labels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"spec":{"containers":[{"command":["/bin/sh","-ecx","exec
nebula-graph.io/restart-timestamp: "1700547115"
nebula-graph.io/restart-timestamp: "1700547815"
```

由上述输出可知,StatefulSet 控制器的注解已经更新并且 Graph 服务的所有 Pod 已经重启。


## 优雅滚动重启单个 Storage 服务 Pod

通过在 Storage 服务的 StatefulSet 控制器的配置中添加注解(annotation)`nebula-graph.io/restart-ordinal`并将值设置为 Storage 服务 Pod 的序号来实现优雅滚动重启单个 Storage 服务 Pod,即执行状态转移操作。在 Storage 服务 Pod 重启后,添加的注解会被删除。

以下示例为序号为`1`的 Storage Pod 添加注解,表示将优雅重启名为`nebula-storaged-1`的 Storage 服务 Pod。

假设所有集群名为`nebula`,集群资源都放在`default`命名空间下,执行以下命令:

1. 查看 StatefulSet 控制器名称。

```bash
kubectl get statefulset
```

示例输出:

```bash
NAME READY AGE
nebula-graphd 2/2 33s
nebula-metad 3/3 69s
nebula-storaged 3/3 69s
```

2. 获取 Storage 服务 Pod 的序号。

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=storaged
```

示例输出:

```bash
NAME READY STATUS RESTARTS AGE
nebula-storaged-0 1/1 Running 0 13h
nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-2 1/1 Running 0 13h
nebula-storaged-3 1/1 Running 0 13h
nebula-storaged-4 1/1 Running 0 13h
nebula-storaged-5 1/1 Running 0 13h
nebula-storaged-6 1/1 Running 0 13h
nebula-storaged-7 1/1 Running 0 13h
nebula-storaged-8 1/1 Running 0 13h
```

3. 为`nebula-storaged-1` Pod 添加注解以触发优雅滚动重启该 Pod 操作。

```bash
kubectl annotate statefulset nebula-storaged nebula-graph.io/restart-ordinal="1"
```

示例输出:

```bash
statefulset.apps/nebula-storaged annotate
```

4. 观察重启过程。

```bash
kubectl get pods -l app.kubernetes.io/cluster=nebula,app.kubernetes.io/component=storaged -w
```

示例输出:

```bash
NAME READY STATUS RESTARTS AGE
nebula-storaged-0 1/1 Running 0 13h
nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-2 1/1 Running 0 13h
nebula-storaged-3 1/1 Running 0 13h
nebula-storaged-4 1/1 Running 0 13h
nebula-storaged-5 1/1 Running 0 12h
nebula-storaged-6 1/1 Running 0 12h
nebula-storaged-7 1/1 Running 0 12h
nebula-storaged-8 1/1 Running 0 12h


nebula-storaged-1 1/1 Running 0 13h
nebula-storaged-1 1/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Terminating 0 13h
nebula-storaged-1 0/1 Pending 0 0s
nebula-storaged-1 0/1 Pending 0 0s
nebula-storaged-1 0/1 ContainerCreating 0 0s
nebula-storaged-1 0/1 Running 0 1s
nebula-storaged-1 1/1 Running 0 10s

由上述输出可知,`nebula-storaged-1` Storage 服务 Pod 已经重启。