doc(op): monitor with prometheus (#1387)

Signed-off-by: Jiyong Huang <huangjy@emqx.io> Signed-off-by: Jiyong Huang <huangjy@emqx.io>
lf-edge · Aug 31, 2022 · 3f21ccc · 3f21ccc
1 parent 9b51f7b
commit 3f21ccc
Show file tree

Hide file tree

Showing 5 changed files with 266 additions and 0 deletions.
diff --git a/docs/directory.json b/docs/directory.json
@@ -131,6 +131,10 @@
 						{
 							"title": "使用 Protobuf 编解码教程",
 							"path": "tutorials/usage/protobuf_tutorial"
+						},
+						{
+							"title": "使用 Prometheus 监控规则运行状态",
+							"path": "tutorials/usage/monitor_with_prometheus"
 						}
 					]
 				}
@@ -677,6 +681,10 @@
 						{
 							"title": "Protobuf Codec Tutorial",
 							"path": "tutorials/usage/protobuf_tutorial"
+						},
+						{
+							"title": "Monitor rule status with Prometheus",
+							"path": "tutorials/usage/monitor_with_prometheus"
 						}
 					]
 				}

diff --git a/docs/en_US/tutorials/usage/monitor_with_prometheus.md b/docs/en_US/tutorials/usage/monitor_with_prometheus.md
@@ -0,0 +1,128 @@
+# Monitor rule status with Prometheus
+
+Prometheus is an open source system monitoring and alerting toolkit hosted at CNCF, and has been adopted by many companies and organizations as a monitoring and alerting tool.
+
+eKuiper's rules are continuously running streaming task. Rules are used to process unbounded streams of data, and under normal circumstances, rules are started and run continuously, producing operational status data. Until the rule is stopped manually or after an unrecoverable error. eKuiper provides a status API to get the running metrics of the rules. At the same time, eKuiper integrates with Prometheus, making it easy to monitor various status metrics through the latter. This tutorial is intended for users who are already familiar with eKuiper and will introduce rule status metrics and how to monitor specific indicators via Prometheus.
+
+## Rule Status Metrics
+
+Once a rule has been created and run successfully using eKuiper, the user can view the rule's operational status metrics via the CLI, REST API or the management console. For example, for an existing rule1, you can get the rule run metrics in JSON format via `curl -X GET "http://127.0.0.1:9081/rules/rule1/status"`.
+
+```json
+{
+  "status": "running",
+  "source_demo_0_records_in_total": 265,
+  "source_demo_0_records_out_total": 265,
+  "source_demo_0_process_latency_us": 0,
+  "source_demo_0_buffer_length": 0,
+  "source_demo_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "source_demo_0_exceptions_total": 0,
+  "source_demo_0_last_exception": "",
+  "source_demo_0_last_exception_time": 0,
+  "op_2_project_0_records_in_total": 265,
+  "op_2_project_0_records_out_total": 265,
+  "op_2_project_0_process_latency_us": 0,
+  "op_2_project_0_buffer_length": 0,
+  "op_2_project_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "op_2_project_0_exceptions_total": 0,
+  "op_2_project_0_last_exception": "",
+  "op_2_project_0_last_exception_time": 0,
+  "sink_mqtt_0_0_records_in_total": 265,
+  "sink_mqtt_0_0_records_out_total": 265,
+  "sink_mqtt_0_0_process_latency_us": 0,
+  "sink_mqtt_0_0_buffer_length": 0,
+  "sink_mqtt_0_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "sink_mqtt_0_0_exceptions_total": 0,
+  "sink_mqtt_0_0_last_exception": "",
+  "sink_mqtt_0_0_last_exception_time": 0
+}
+```
+
+The rule status consists of two main parts, one is the status, which is used to indicate whether the rule is running properly or not, its value may be `running`, `stopped manually`, etc. The other part is the metrics for each operator of the rule. The operator of the rule is generated based on the SQL of the rule, which may be different for each rule. In this example, the rule SQL is the simplest `SELECT * FROM demo`, the action is MQTT, and the generated operators are [source_demo, op_project, sink_mqtt]. Each of these operators has the same kind of metrics, which together with the operator names form a single metric. For example, the metric for the number of records_in_total for the operator source_demo_0 is `source_demo_0_records_in_total`.
+
+### Metric Types
+
+The metrics are the same for each operator and are mainly the following:
+
+- records_in_total: the total number of messages read in, indicating how many messages have been processed since the rule was started.
+- records_out_total: total number of messages output, indicating the number of messages processed by the operator **correctly**.
+- process_latency_us: latency of the most recent processing in microseconds. The value is instantaneous and gives an idea of the processing performance of the operator. The latency of the overall rule is generally determined by the operator with the largest latency.
+- buffer_length: the length of the buffer. Since there is a difference in computation speed between operators, there is a buffer queue between each operator. A larger buffer length means the processing is slower and cannot catch up with the upstream processing speed.
+- last_invocation: the time of the last run of the operator.
+- exceptions_total: the total number of exceptions. Reconverable errors generated during the operation of the operator, such as broken connections, data format errors, etc., are counted as exceptions without stopping the rule.
+
+After version 1.6.1, we added two more exception-related metrics to facilitate the debugging of exceptions.
+
+- last_exception: the error message of the last exception.
+- last_exception_time: the time of the last exception.
+
+The numeric types of these metrics can all be monitored using Prometheus. In the next section we will describe how to configure the Prometheus service in eKuiper.
+
+## Configuring the Prometheus Service in eKuiper
+
+The Prometheus service comes with eKuiper, but is disabled by default. You can turn on the service by modifying the configuration in `etc/kuiper.yaml`. Where `prometheus` is a boolean value, change it to `true` to turn on the service; `prometheusPort` configures the port of the service.
+
+```yaml
+  prometheus: true
+  prometheusPort: 20499
+```
+
+If you start eKuiper with Docker, you can also enable the service by configuring environment variables.
+
+```shell
+docker run -p 9081:9081 -d --name ekuiper MQTT_SOURCE__DEFAULT__SERVER="$MQTT_BROKER_ADDRESS" KUIPER__BASIC__PROMETHEUS=true lfedge/ekuiper :$tag
+```
+
+In the log of the startup, you can see information about the service startup, for example:
+
+```text
+time="2022-08-22 17:16:50" level=info msg="Serving prometheus metrics on port http://localhost:20499/metrics" file="server/prome_init.go:60 "
+Serving prometheus metrics on port http://localhost:20499/metrics
+```
+
+Click on the address `http://localhost:20499/metrics` in the prompt to see the raw metrics information for eKuiper collected in Prometheus. Users can search the page for metrics like `kuiper_sink_records_in_total` after the eKuiper has rules running properly. Users can configure Prometheus to connect to eKuiper later for a richer presentation.
+
+## Using Prometheus to monitor status
+
+Above we have implemented the ability to export eKuiper status as Prometheus metrics, we can then configure Prometheus to access this part of the metrics and complete the monitoring.
+
+### Installation and Configuration
+
+Go to the [Prometheus website](https://prometheus.io/download/) to download the version for your platform and then unzip it.
+
+Modify the configuration file so that it monitors eKuiper. open `prometheus.yml` and modify the scrape_configs section as follows.
+
+```yaml
+global:
+scrape_interval: 15s
+evaluation_interval: 15s
+
+rule_files:
+# - "first.rules"
+# - "second.rules"
+
+scrape_configs:
+- job_name: ekuiper
+  static_configs:
+    - targets: ['localhost:20499']
+```
+
+This defines a monitoring job named `eKuiper`, targets pointing to the address of the service started in the previous section. After the configuration is done, start Prometheus.
+
+```shell
+. /prometheus --config.file=prometheus.yml
+```
+
+After successful startup, open `http://localhost:9090/` to access the management console.
+
+### Simple monitoring
+
+Monitor the number of messages received by the sink for all rules. You can enter the name of the metric to be monitored in the search box as shown in the figure, and click `Execute` to generate the monitoring table. Select `Graph` to switch to line graphs and other display methods.
+
+![Set monitor in prometheus](./resources/prom.png)
+
+Click `Add Panel` to monitor more metrics in the same way.
+
+## Summary
+
+This article introduced the rule metrics in eKuiper and how to use Prometheus to monitor these metrics. Users can further explore more advanced features of Prometheus based on this to improve eKuiper's operation and maintenance.
diff --git a/docs/en_US/tutorials/usage/resources/prom.png b/docs/en_US/tutorials/usage/resources/prom.png
diff --git a/docs/zh_CN/tutorials/usage/monitor_with_prometheus.md b/docs/zh_CN/tutorials/usage/monitor_with_prometheus.md
@@ -0,0 +1,130 @@
+# 使用 Prometheus 监控规则运行状态
+
+Prometheus 是一个托管于 CNCF 的开源系统监控和警报工具包，许多公司和组织都采用了 Prometheus 作为监控告警工具。
+
+eKuiper 的规则是一个持续运行的流式计算任务。规则用于处理无界的数据流，正常情况下，规则启动后会一直运行，不断产生运行状态数据。直到规则被手动停止或出现不可恢复的错误后停止。eKuiper 中的规则提供了状态 API，可获取规则的运行指标。同时，eKuiper 整合了 Prometheus，可方便地通过后者监控各种状态指标。本教程面向已经初步了解 eKuiper 的用户，将介绍规则状态指标以及如何通过 Prometheus 监控特定的指标。
+
+## 规则状态指标
+
+使用 eKuiper 创建规则并运行成功后，用户可以通过 CLI，REST API 或者管理控制台查看规则的运行状态指标。例如，已有规则 rule1，可通过 `curl -X GET "http://127.0.0.1:9081/rules/rule1/status"` 获取 JSON 格式的规则运行指标，如下所示：
+
+```json
+{
+  "status": "running",
+  "source_demo_0_records_in_total": 265,
+  "source_demo_0_records_out_total": 265,
+  "source_demo_0_process_latency_us": 0,
+  "source_demo_0_buffer_length": 0,
+  "source_demo_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "source_demo_0_exceptions_total": 0,
+  "source_demo_0_last_exception": "",
+  "source_demo_0_last_exception_time": 0,
+  "op_2_project_0_records_in_total": 265,
+  "op_2_project_0_records_out_total": 265,
+  "op_2_project_0_process_latency_us": 0,
+  "op_2_project_0_buffer_length": 0,
+  "op_2_project_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "op_2_project_0_exceptions_total": 0,
+  "op_2_project_0_last_exception": "",
+  "op_2_project_0_last_exception_time": 0,
+  "sink_mqtt_0_0_records_in_total": 265,
+  "sink_mqtt_0_0_records_out_total": 265,
+  "sink_mqtt_0_0_process_latency_us": 0,
+  "sink_mqtt_0_0_buffer_length": 0,
+  "sink_mqtt_0_0_last_invocation": "2022-08-22T17:19:10.979128",
+  "sink_mqtt_0_0_exceptions_total": 0,
+  "sink_mqtt_0_0_last_exception": "",
+  "sink_mqtt_0_0_last_exception_time": 0
+}
+```
+
+运行指标主要包括两个部分，一部分是 status，用于标示规则是否正常运行，其值可能为 `running`, `stopped manually` 等。另一部分为规则每个算子的运行指标。规则的算子根据规则的 SQL 生成，每个规则可能会有所不同。在此例中，规则 SQL 为最简单的 `SELECT * FROM demo`, action 为 MQTT，其生成的算子为 [source_demo, op_project, sink_mqtt] 3个。每一种算子都有相同数目的运行指标，与算子名字合起来构成一条指标。例如，算子 source_demo_0 的输入数量 records_in_total 的指标为 `source_demo_0_records_in_total`。
+
+### 运行指标
+
+每个算子的运行指标是相同的，主要有以下几种：
+
+- records_in_total：读入的消息总量，表示规则启动后处理了多少条消息。
+- records_out_total：输出的消息总量，表示算子**正确**处理的消息数量。
+- process_latency_us：最近一次处理的延时，单位为微妙。该值为瞬时值，可了解算子的处理性能。整体规则的延时一般由延时最大的算子决定。
+- buffer_length：算子缓冲区长度。由于算子之间计算速度会有差异，各个算子之间都有缓冲队列。缓冲区长度较大的话说明算子处理较慢，赶不上上游处理速度。
+- last_invocation：算子的最后一次运行的时间。
+- exceptions_total：异常总量。算子运行中产生的非不可恢复的错误，例如连接中断，数据格式错误等均计入异常，而不会中断规则。
+
+在 1.6.1 版本以后，我们又添加了两个异常相关指标，方便异常的调试处理。
+
+- last_exception：最近一次的异常的错误信息。
+- last_exception_time：最近一次异常的发生时间。
+
+这些运行指标中的数值类型指标均可使用 Prometheus 进行监控。下一节我们将描述如何配置 eKuiper 中的 Prometheus 服务。
+
+## 配置 eKuiper 的 Prometheus 服务
+
+eKuiper 中自带 Prometheus 服务，但是默认为关闭状态。用户可修改 `etc/kuiper.yaml` 中的配置打开该服务。其中，`prometheus` 为布尔值，修改为 `true` 可打开服务；`prometheusPort` 配置服务的访问端口。
+
+```yaml
+  prometheus: true
+  prometheusPort: 20499
+```
+
+若使用 Docker 启动 eKuiper，也可通过配置环境变量启用服务。
+
+```shell
+docker run -p 9081:9081 -d --name ekuiper MQTT_SOURCE__DEFAULT__SERVER="$MQTT_BROKER_ADDRESS" KUIPER__BASIC__PROMETHEUS=true lfedge/ekuiper:$tag
+```
+
+在启动的日志中，可以看到服务启动的相关信息，例如:
+
+```text
+time="2022-08-22 17:16:50" level=info msg="Serving prometheus metrics on port http://localhost:20499/metrics" file="server/prome_init.go:60"
+Serving prometheus metrics on port http://localhost:20499/metrics
+```
+
+点击提示中的地址 `http://localhost:20499/metrics` ，可查看到 Prometheus 中搜集到的 eKuiper 的原始指标信息。eKuiper 有规则正常运行之后，可以在页面中搜索到类似 `kuiper_sink_records_in_total` 等的指标。用户可以配置 Prometheus 接入 eKuiper，进行更丰富的展示。
+
+## 使用 Prometheus 查看状态
+
+上文我们已经实现了将 eKuiper 状态输出为 Prometheus 指标的功能，接下来我们可以配置 Prometheus 接入这一部分指标，并完成初步的监控。
+
+### 安装和配置
+
+到 [Prometheus 官方网站](https://prometheus.io/download/) 下载所需要的系统版本然后解压。
+
+修改配置文件，使其监控 eKuiper。打开 `prometheus.yml`，修改 scrape_configs 部分，如下所示：
+
+```yaml
+global:
+  scrape_interval:     15s
+  evaluation_interval: 15s
+
+rule_files:
+  # - "first.rules"
+  # - "second.rules"
+
+scrape_configs:
+  - job_name: ekuiper
+    static_configs:
+      - targets: ['localhost:20499']
+```
+
+此处定义了监控任务名为 `eKuiper`, targets 指向上一节启动的服务的地址。配置完成后，启动 Prometheus 。
+
+```shell
+./prometheus --config.file=prometheus.yml
+```
+
+启动成功后，打开 `http://localhost:9090/` 可进入管理控制台。
+
+### 简单监控
+
+监控所有规则的 sink 接收到的消息数目变化。可以在如图的搜索框中输入需要监控的指标名称，点击 `Execute` 即可生成监控表。选择 `Graph` 可切换为折线图等展示方式。
+
+![set monitor in prometheus](./resources/prom.png)
+
+点击 `Add Panel`，通过同样的配置方式，可监控更多的指标。
+
+## 总结
+
+本文介绍了 eKuiper 中的规则状态指标以及如何使用 Prometheus 简单地监控这些状态指标。用户朋友可以基于此进一步探索 Prometheus 的更多高级功能，更好地实现 eKuiper 的运维。
+
+
diff --git a/docs/zh_CN/tutorials/usage/resources/prom.png b/docs/zh_CN/tutorials/usage/resources/prom.png