Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

add cluster-utilization report doc #5331

Merged
merged 3 commits into from
Mar 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ authentication:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ rest-server:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down
31 changes: 31 additions & 0 deletions docs/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,3 +248,34 @@ Remember to re-build and push the docker image, and restart the `alert-manager`
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

## Cluster GPU Utilization Report

We provide the functionality to send cluster GPU utilization report regularly to admin users.

The report includes the statistics for:
- Cluster GPU utilization
- User GPU utilization
- Job GPU utilization

To enable this feature, you should configure the `alert-manager` field in `services-configuration.yml`.
`pai-bearer-token` & `cluster-utilization`->`schedule` are necessary fields for this feature.
For the syntax of `schedule`, please refer to [Cron Schedule Syntax](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax).
For example, `"0 0 * * *"` means daily report at UTC 00:00.
Please also make sure that the [`email-admin`](#Existing-Actions-and-Matching-Rules) action is enabled.

```yaml
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: "0 0 * * *" # daily report at UTC 00:00
```

To make your configuration take effect, restart the `alert-manager` service after your modification with the following commands in the dev-box container:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
31 changes: 31 additions & 0 deletions docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,3 +232,34 @@ alert-manager:
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

## Cluster GPU Utilization Report

我们提供了将群集GPU使用率报告定期发送给管理员用户的功能。

该报告包括以下方面的统计信息:
- 集群GPU利用率
- 用户GPU利用率
- 任务GPU利用率

要启用此功能,您应该在`services-configuration.yml`中配置`alert-manager`字段。
`pai-bearer-token`和`cluster-utilization`->`schedule`是此功能的必要字段。
有关`schedule`字段的语法,请参阅[定时计划语法](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax)。
例如,`"0 0 * * *"`表示每日在UTC 00:00发送报告。
同时请确保已启用[`email-admin`](#Existing-Actions-and-Matching-Rules)处理措施。

```yaml
alert-manager:
pai-bearer-token: 'your-application-token-for-pai-rest-server'
cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
schedule: "0 0 * * *" # daily report at UTC 00:00
```

为使配置生效,请在dev box容器中使用以下命令重启`alert-manager`服务:

```bash
./paictl.py service stop -n alert-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```
2 changes: 1 addition & 1 deletion examples/cluster-configuration/services-configuration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ rest-server:
# smtp-auth-username: alert-sender@example.com
# smtp-auth-password: password-for-alert-sender
# cluster-utilization: # cluster-utilization is a k8s CronJob which reports the GPU utilization of the cluster
# # for schedule syntex, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# # for schedule syntax, refer to https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-schedule-syntax
# schedule: "0 0 * * *" # daily report at UTC 00:00
# customized-routes:
# routes:
Expand Down