fix: updates resources for OO and P-O

* fix: updates resources for OO and p-o Problem: we pull the Prometheus Operator (p-o) deployment from the upstream repo as a dependency. However this manifest sets a very low limit to the p-o resources, this limit is easily hit when the operator is managing multiple Prometheus instances. Solution: remove current limits and load testing on OO and p-o and observe the resources they consume. Establish a baseline for both and then multiply that baseline by 3 and give some headroom Issue https://issues.redhat.com/browse/MON-2648 Closes #166 Co-authored-by: Sunil Thaha <sthaha@redhat.com>
rhobs · Jul 7, 2022 · 8658ccf · 8658ccf
1 parent 95fe81a
commit 8658ccf
Show file tree

Hide file tree

Showing 4 changed files with 119 additions and 0 deletions.
diff --git a/deploy/dependencies/kustomization.yaml b/deploy/dependencies/kustomization.yaml
@@ -39,6 +39,9 @@ patches:
                   requests:
                     cpu: 5m
                     memory: 150Mi
+                  limits:
+                    cpu: 100m
+                    memory: 500Mi
                 terminationMessagePolicy: FallbackToLogsOnError
             securityContext:
               runAsNonRoot: true

diff --git a/deploy/operator/kustomization.yaml b/deploy/operator/kustomization.yaml
@@ -16,3 +16,19 @@ images:
 - name: observability-operator
   newTag: 0.0.11
 namespace: operators
+
+patches:
+- patch: |-
+    - op: add
+      path: /spec/template/spec/containers/0/resources
+      value:
+        requests:
+          cpu: 5m
+          memory: 50Mi
+        limits:
+          cpu: 50m
+          memory: 150Mi
+  target:
+    group: apps
+    kind: Deployment
+    version: v1
diff --git a/docs/assess-resources.md b/docs/assess-resources.md
@@ -0,0 +1,42 @@
+
+# Procedure to assess resources used by Observability Operator
+
+1. Provision an OpenShift cluster
+
+2. Run `oc apply -f hack/olm/catalog-src.yaml` to install the Observability Operator (OO) catalogue.
+
+3. Using the UI install OO
+
+4. Scale down the following deployments, so we can remove the currently set limits on OO:
+
+```bash
+# Scale down the cluster version operator
+oc -n openshift-cluster-version scale deployment.apps/cluster-version-operator --replicas=0
+# Scale down the OLM operator
+oc -n openshift-operator-lifecycle-manager scale deployment.apps/olm-operator --replicas=0
+```
+
+5. Edit the OO and Prometheus Operator deployment to remove it's limits with:
+
+```bash
+oc -n openshift-operators patch deployment.apps/observability-operator --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits"}]'
+oc -n openshift-operators patch deployment.apps/observability-operator-prometheus-operator --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits"}]'
+```
+
+6. Run the load tests with `./hack/loadtest/test.sh`
+
+7. Using the OpenShift UI in the Developer tab, navigate to Observe and input the following querries.
+    1. For memory we should look at `container_memory_rss` as that is the metric used by kubelet to OOM kill the container
+    2. For CPU we should look at `container_cpu_usage_seconds_total` as that is the metric used by kubelet
+
+```bash
+# PromQL for memory
+container_memory_rss{container!~"|POD", namespace="openshift-operators"}
+# PromQL for CPU
+sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace='openshift-operators'}) by (pod)
+```
+
+8. Take for both OO and Prometheus Operator measurements of their preformance
+    1. Establish a baseline for both CPU and memory (minimum they consume), those will be our `requests`
+    2. Multiply that value by 3 and validate that it fits the intervals of values observed, those will be our `limits`
+    3. Give some extra head room to `limits` to anticipate feature growth
diff --git a/hack/loadtest/test.sh b/hack/loadtest/test.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+
+set -e -u -o pipefail
+trap cleanup INT
+
+# Functions that given a number it creates a namespace
+# and in that namespace it creates a monitoring stack
+create_monitoring_stack() {
+
+  local stack_number=$1; shift
+  local ms_name=stack-$stack_number
+  local namespace=loadtest-$stack_number
+
+    monitoring_stack=$(cat <<- EOF
+apiVersion: monitoring.rhobs/v1alpha1
+kind: MonitoringStack
+metadata:
+  name: ${ms_name}
+  namespace: ${namespace}
+  labels:
+    load-test: test
+spec:
+  logLevel: debug
+  retention: 15d
+  resourceSelector:
+    matchLabels:
+      load-test-instance: ${ms_name}
+EOF
+)
+
+  kubectl create namespace "$namespace"
+  echo "$monitoring_stack" | kubectl -n "$namespace" apply -f -
+}
+
+cleanup() {
+  echo "INFO: cleaning up all namespaces"
+  kubectl delete ns loadtest-{1..10}
+}
+
+main() {
+  # Goal: create 10 monitoring stack CRs, wait for OO to
+  # reconcile and then clean-up
+
+  echo "INFO: Running load test"
+  for ((i=1; i<=10; i++)); do
+    create_monitoring_stack "$i"
+  done
+
+  # Give some time for OO to reconcile all the MS
+  # and create the necessary resources
+  local timeout=180
+  echo "INFO: sleeping for $timeout"
+  sleep "$timeout"
+
+  cleanup
+}
+
+ main "$@"