Install refactor and ping-iperf bugfix (#54)

This PR addresses a refactor of helm install procedure enhanced readme files improvements in the health checks alert rules bugfix: ping and iperf tests to work on non-openshift deployments cleanup of unused folders and files This change is disruptive wrt previous releases, therefore we bump the version to 2.0 Signed-off-by: Claudia <c.misale@ibm.com>
IBM · Oct 30, 2024 · 51578a6 · 51578a6
1 parent e5819c6
commit 51578a6
Show file tree

Hide file tree

Showing 34 changed files with 566 additions and 1,338 deletions.
diff --git a/.github/workflows/build-push-image.yml → .../workflows/build-push-image-from-main.yml b/.github/workflows/build-push-image.yml → .../workflows/build-push-image-from-main.yml
@@ -1,4 +1,4 @@
-name: Build and Push Container Image
+name: Build and Push Latest Container Image
 
 on:
   workflow_dispatch:
@@ -25,7 +25,6 @@ jobs:
         uses: docker/metadata-action@v5
         with:
           images: quay.io/autopilot/autopilot
-          tags: ${{ steps.meta.outputs.tags }}
 
       - name: Log into registry 
         if: github.event_name != 'pull_request'
@@ -40,4 +39,4 @@ jobs:
         with:
           context: autopilot-daemon
           push: ${{ github.event_name != 'pull_request' }}
-          tags: ${{ steps.meta.outputs.tags }}        
+          tags: "latest"  
diff --git a/.github/workflows/publish-release.yml b/.github/workflows/publish-release.yml
@@ -1,7 +1,7 @@
-name: Publish Release
+# This is a basic workflow to help you get started with Actions
+
+name: Create New Release - Quay and Helm
 on:
-  push:
-    tags: '*'
   workflow_dispatch:
 
 jobs:
@@ -19,7 +19,6 @@ jobs:
         run: |
           git config user.name "$GITHUB_ACTOR"
           git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
-
       - name: Install Helm
         uses: azure/setup-helm@v3
         with:
@@ -33,5 +32,46 @@ jobs:
         with:
           pages_branch: gh-pages
           charts_dir: helm-charts
+          skip_existing: true
           packages_with_index: true
           token: ${{ secrets.GITHUB_TOKEN }}
+
+  docker:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Remove unnecessary files
+        run: |
+          sudo rm -rf /usr/share/dotnet
+          sudo rm -rf /usr/local/lib/android
+          
+      - name: Checkout
+        uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+
+      - name: Read helm chart version
+        run: echo "CHART_VERSION=$(grep '^version:' helm-charts/autopilot/Chart.yaml | cut -d ":" -f2 | tr -d ' ')" >> $GITHUB_ENV
+
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Docker meta
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: quay.io/autopilot/autopilot
+          tags: ${{ env.CHART_VERSION }}
+
+      - name: Log into registry 
+        uses: docker/login-action@v3
+        with:
+          registry: quay.io
+          username: ${{ secrets.QUAY_USERNAME }}
+          password: ${{ secrets.QUAY_PASSWORD }}
+
+      - name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: autopilot-daemon
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
diff --git a/HEALTH_CHECKS.md b/HEALTH_CHECKS.md
@@ -0,0 +1,65 @@
+# Health Checks
+
+Here is a breakdown of the existing health checks:
+
+1. **PCIe Bandwidth Check (pciebw)**
+    - Description  : Host-to-device connection speeds, one measurement per GPU. Codebase in tag [v12.4.1](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest)
+    - Outputs: Pass/fail results based on PCIe bandwidth thresholds.
+    - Implementation: Compares bandwidth results to a threshold (e.g., 8 GB/s). If the measured bandwidth falls below the threshold, it triggers a failure.
+2. **GPU Memory Check (remapped)**
+    - Description: Information from nvidia-smi regarding GPU memory remapped rows.
+    - Outputs: Reports the state of GPU memory (normal/faulty).
+    - Implementation: Analyzes remapped rows information to assess potential GPU memory issues.
+3. **GPU Memory Bandwidth Performance (gpumem)**
+    - Description: Memory bandwidth measurements using DAXPY and DGEMM.
+    - Outputs: Performance metrics (eg., TFlops, power).
+    - Implementation: CUDA code that valuates memory bandwidth and flags deviations from expected performance values.
+4. **GPU Diagnostics (dcgm)**
+    - Description: Runs NVidia DCGM diagnostics using dcgmi diag.
+    - Outputs: Diagnostic results (pass/fail).
+    - Implementation: Analyzes GPU health, including memory, power, and thermal performance.
+5. **PVC Create/Delete (pvc)**
+    - Description: Given a storage class, tests if a PVC can be created and deleted.
+    - Output: pass/fail depending on the success or failure of creation and deletion of a PVC. If either operation fail, the result is a failure.
+    - Implementation: creation of a PVC through K8s APIs.
+6. **Network Reachability Check (ping)**
+    - Description: Pings between nodes to assess connectivity.
+    - Outputs: Pass/fail based on ping success.
+    - Implementation: all-to-all reachability test.
+7. **Network Bandwidth Check (iperf)**
+    - Description: Tests network bandwidth by launching clients and servers on multiple interfaces through iperf3. Results are aggregated per interface results from network tests. Further details can be found in [the dedicated page](autopilot-daemon/network/README.md).
+    - Outputs: Aggregate bandwidth on each interface, per node (in Gb/s).
+    - Implementation: Tests network bandwidth by launching clients and servers on multiple interfaces and by running a ring topology on all network interfaces found on the pod that are exposed by network controllers like multi-nic CNI, which exposes fast network interfaces in the pods requesting them. Does not run on `eth0`.
+
+These checks are configured to run periodically (e.g., hourly), and results are accessible via Prometheus, direct API queries or labels on the worker nodes.
+
+## Deep Diagnostics and Node Labeling
+
+Autopilot runs health checks periodically on GPU nodes, and if any of the health checks returns an error, the node is labeled with `autopilot.ibm.com/gpuhealth: ERR`. Otherwise, the label is set as `PASS`.
+
+Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
+This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
+
+If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format: 
+
+`ERR_Year-Month-Date_Hour.Minute.UTC_Diagnostic_Test.gpuID,Diagnostic_Test.gpuID,...`
+
+- `ERR`: An indicator that an error has occurred
+- `Year-Month-Date_Hour.Minute.UTC`: Timestamp of completed diagnostics
+- `Diagnostic_Test`: Name of the test that has failed (formatted to replace spaces with underscores)
+- `gpuID`: ID of GPU where the failure has occurred
+
+**Example:** `autopilot.ibm.com/dcgm.level.3=ERR_2024-10-10_19.12.03UTC_page_retirement_row_remap.0`
+
+If there are no errors, the value is set to `PASS_Year-Month-Date_Hour.Minute.UTC`.
+
+### Logs and Metrics
+
+All health checks results are exported through Prometheus, but they can be also found in each pod's logs.
+
+All metrics are accessible through Prometheus and Grafana dashboards. The gauge exposed is `autopilot_health_checks` and can be customized with the following filters:
+
+- `check`, select one or more specific health checks
+- `node`, filter by node name
+- `cpumodel` and `gpumodel`, for heterogeneous clusters
+- `deviceid` to select specific GPUs, when available
diff --git a/README.md b/README.md
@@ -58,189 +58,12 @@ By default, the periodic checks list contains PCIe, rows remapping, GPUs power,
 
 Results from health checks are exported as Prometheus Gauges, so that users and admins can easily check the status of the system on Grafana.
 
-## Deep Diagnostics and Node Labeling
+Detailed description of all the health checks, can be found in [HEALTH_CHECKS.md](HEALTH_CHECKS.md).
 
-Autopilot runs health checks periodically and labels the nodes with `autopilot.ibm.com/gpuhealth: ERR` is any of the GPU health checks returns an error. Otherwise, health is set as `PASS`.
+## Install
 
-Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
-This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
+To learn how to install Autopilot, please refer to [SETUP.md](SETUP.md)
 
-If errors are found, the label `autopilot.ibm.com/dcgm.level.3` will contain the value `ERR`, a timestamp, the test(s) that failed and the GPU id(s) if available. Otherwise, the value is set to `PASS_timestamp`.
+## Usage
 
-### Logs and Metrics
-
-All health checks results are exported through Prometheus, but they can be also found in each pod's logs.
-
-All metrics are accessible through Prometheus and Grafana dashboards. The gauge exposed is `autopilot_health_checks` and can be customized with the following filters:
-
-- `check`, select one or more specific health checks
-- `node`, filter by node name
-- `cpumodel` and `gpumodel`, for heterogeneous clusters
-- `deviceid` to select specific GPUs, when available
-
-## Install Autopilot
-
-Autopilot can be installed through Helm and need admin privileges to create objects like services, serviceaccounts, namespaces and relevant RBAC.
-
-**NOTE**: this install procedure does NOT allow the use of `--create-namespace` or `--namespace=autopilot` in the `helm` command. This is because our helm chart, creates a namespace with a label, namely, we are creating a namespace with the label `openshift.io/cluster-monitoring: "true"`, so that Prometheus can scrape metrics. This applies to OpenShift clusters **only**. It is not possible, in Helm, to create namespaces with labels or annotations through the `--create-namespace` parameter, so we decided to create the namespace ourselves.
-
-Therefore, we recommend one of the following options, which are mutually exclusive:
-
-- Option 1: use `--create-namespace` with `--namespace=autopilot` in the helm cli AND disable namespace creation in the helm chart config file `namespace.create=false`. **If on OpenShift**, then manually label the namespace for Prometheus with `label ns autopilot "openshift.io/cluster-monitoring"=`.
-- Option 2: use `namespace.create=true` in the helm chart config file BUT NOT use `--create-namespace` in the helm cli. Can still use `--namespace` in the helm cli but it should be set to something else (i.e., `default`).
-- Option 3: create the namespace by hand `kubectl create namespace autopilot`, use `--namespace autopilot` in the helm cli and set `namespace.create=false` in the helm config file. **If on OpenShift**, then manually label the namespace for Prometheus with `label ns autopilot "openshift.io/cluster-monitoring"=`.
-
-In the next release, we will remove the namespace from the Helm templates and will add OpenShift-only configurations separately.
-
-### Requirements
-
-- Need to install `helm-git` plugin
-
-```bash
-helm plugin install https://github.com/aslafy-z/helm-git --version 0.15.1
-```
-
-### Helm Chart customization
-
-Helm charts values and how-to for customization can be found [here](https://github.com/IBM/autopilot/tree/main/autopilot-daemon/helm-charts/autopilot).
-
-### Install
-
-1) Add autopilot repo
-
-```bash
-helm repo add autopilot git+https://github.com/IBM/autopilot.git@autopilot-daemon/helm-charts/autopilot?ref=gh-pages
-```
-
-2) Install autopilot (idempotent command). The config file is for customizing the helm values. Namespace is where the helm chart will live, not the namespace where Autopilot runs
-
-```bash
-helm upgrade autopilot autopilot/autopilot --install --namespace=<default> -f your-config.yml
-```
-
-The controllers should show up in the selected namespace
-
-```bash
-kubectl get po -n autopilot
-```
-
-```bash
-NAME                               READY   STATUS    RESTARTS   AGE
-autopilot-daemon-autopilot-g7j6h   1/1     Running   0          70m
-autopilot-daemon-autopilot-g822n   1/1     Running   0          70m
-autopilot-daemon-autopilot-x6h8d   1/1     Running   0          70m
-autopilot-daemon-autopilot-xhntv   1/1     Running   0          70m
-```
-
-### Uninstall
-
-```bash
- helm uninstall autopilot # -n <default>
-```
-
-## Manually Query the Autopilot Service
-
-Autopilot provides a `/status` handler that can be queried to get the entire system status, meaning that it will run all the tests on all the nodes. Autopilot is reachable by service name `autopilot-healthchecks.autopilot.svc` in-cluster only, meaning it can be reached from a pod running in the cluster, or through port forwarding (see below).
-
-Health check names are `pciebw`, `dcgm`, `remapped`, `ping`, `iperf`, `pvc`, `gpumem`.
-
-For example, using port forwarding to localhost or by exposing the service
-
-```bash
-kubectl port-forward service/autopilot-healthchecks 3333:3333 -n autopilot
-# or kubectl expose service autopilot-healthchecks -n autopilot
-```
-
-If using port forward, then launch `curl` on another terminal
-
-```bash
-curl "http://localhost:3333/status?check=pciebw&host=nodename1"
-```
-
-Alternatively, retrieve the route with `kubectl get routes autopilot-healthchecks -n autopilot`
-
-```bash
-curl "http://<route-name>/status?check=pciebw&host=nodename1"
-```
-
-All tests can be tailored by a combination of:
-
-- `host=<hostname1,hostname2,...>`, to run all tests on a specific node or on a comma separated list of nodes.
-- `check=<healthcheck1,healtcheck2,...>`, to run a single test (`pciebw`, `dcgm`, `remapped`, `gpumem`, `ping`, `iperf` or `all`) or a list of comma separated tests. When no parameters are specified, only `pciebw`, `dcgm`, `remapped`, `ping` tests are run.
-- `batch=<#hosts>`, how many hosts to check at a single moment. Requests to the batch are run in parallel asynchronously. Batching is done to avoid running too many requests in parallel when the number of worker nodes increases. Defaults to all nodes.
-
-Some health checks provide further customization.
-
-### DCGM
-
-This test runs `dcgmi diag`, and we support only `r` as [parameter](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#command-line-options).
-
-The default is `1`, but can customize it by `/status?check=dcgm&r=2`.
-
-### Network Bandwidth Validation with IPERF
-
-As part of this workload, Autopilot will generate the Ring Workload and then start `iperf3 servers` on each interface on each Autopilot pod based on the configuration options provided by the user.  Only after the `iperf3 servers` are started, Autopilot will begin executing the workload by starting `iperf3 clients` based on the configuration options provided by the user. All results are logged back to the user.
-
-- For each network interface on each node, an `iperf3 server` is started. The number of `iperf3 servers` is dependent on the `number of clients` intended on being run. For example, if the  `number of clients` is `8`, then there will be `8` `iperf3 servers` started per interface on a unique `port`.
-
-- For each timestep, all `pairs` are executed simultaneously. For each pair some `number of clients` are started in parallel and will run for `5 seconds` using `zero-copies` against a respective `iperf3 server`
-- Metrics such `minimum`, `maximum`, `mean`, `aggregate` bitrates and transfers are tracked. The results are stored both as `JSON` in the respective `pod` as well as summarized and dumped into the `pod logs`.
-- Invocation from the exposed Autopilot API is as follows below:
-
-```bash
-# Invoked via the `status` handle:
-curl "http://autopilot-healthchecks-autopilot.<domain>/status?check=iperf&workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"
-
-# Invoked via the `iperf` handle directly:
-curl "http://autopilot-healthchecks-autopilot.<domain>/iperf?workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"
-```
-
-### Concrete Example
-
-In this example, we target one node and check the pcie bandwidth and use the port-forwarding method.
-In this scenario, we have a value lower than `8GB/s`, which results in an alert. This error will be exported to the OpenShift web console and on Slack, if that is enabled by admins.
-
-```bash
-curl "http://127.0.0.1:3333/status?check=pciebw"
-```
-
-The output of the command above, will be similar to the following (edited to save space):
-
-```bash
-Checking status on all nodes
-Autopilot Endpoint: 10.128.6.187
-Node: hostname
-url(s): http://10.128.6.187:3333/status?host=hostname&check=pciebw
-Response:
-Checking system status of host hostname (localhost) 
-
-[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
-[[ PCIEBW ]] FAIL
-Host  hostname
-12.3 12.3 12.3 12.3 5.3 12.3 12.3 12.3
-
-Node Status: PCIE Failed
--------------------------------------
-
-
-Autopilot Endpoint: 10.131.4.93
-Node: hostname2
-url(s): http://10.131.4.93:3333/status?host=hostname2&check=pciebw
-Response:
-Checking system status of host hostname2 (localhost) 
-
-[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
-[[ PCIEBW ]] SUCCESS
-Host  hostname2
-12.1 12.0 12.3 12.3 11.9 11.5 12.1 12.1
-
-Node Status: Ok
--------------------------------------
-
-Node Summary:
-
-{'hostname': ['PCIE Failed'],
- 'hostname2': ['Ok']}
-
-runtime: 31.845192193984985 sec
-```
+To learn how to invoke health checks, please refer to [USAGE.md](USAGE.md).