Skip to content

Commit

Permalink
fix: updates resources for OO and P-O
Browse files Browse the repository at this point in the history
* fix: updates resources for OO and p-o

Problem: we pull the Prometheus Operator (p-o) deployment from
the upstream repo as a dependency. However this manifest sets
a very low limit to the p-o resources, this limit is easily hit
when the operator is managing multiple Prometheus instances.

Solution: remove current limits and load testing on OO and p-o
and observe the resources they consume. Establish a baseline for
both and then multiply that baseline by 3 and give some headroom
Issue https://issues.redhat.com/browse/MON-2648
Closes #166

Co-authored-by: Sunil Thaha <sthaha@redhat.com>
  • Loading branch information
JoaoBraveCoding and sthaha authored Jul 7, 2022
1 parent 95fe81a commit 8658ccf
Show file tree
Hide file tree
Showing 4 changed files with 119 additions and 0 deletions.
3 changes: 3 additions & 0 deletions deploy/dependencies/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ patches:
requests:
cpu: 5m
memory: 150Mi
limits:
cpu: 100m
memory: 500Mi
terminationMessagePolicy: FallbackToLogsOnError
securityContext:
runAsNonRoot: true
Expand Down
16 changes: 16 additions & 0 deletions deploy/operator/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,19 @@ images:
- name: observability-operator
newTag: 0.0.11
namespace: operators

patches:
- patch: |-
- op: add
path: /spec/template/spec/containers/0/resources
value:
requests:
cpu: 5m
memory: 50Mi
limits:
cpu: 50m
memory: 150Mi
target:
group: apps
kind: Deployment
version: v1
42 changes: 42 additions & 0 deletions docs/assess-resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@

# Procedure to assess resources used by Observability Operator

1. Provision an OpenShift cluster

2. Run `oc apply -f hack/olm/catalog-src.yaml` to install the Observability Operator (OO) catalogue.

3. Using the UI install OO

4. Scale down the following deployments, so we can remove the currently set limits on OO:

```bash
# Scale down the cluster version operator
oc -n openshift-cluster-version scale deployment.apps/cluster-version-operator --replicas=0
# Scale down the OLM operator
oc -n openshift-operator-lifecycle-manager scale deployment.apps/olm-operator --replicas=0
```

5. Edit the OO and Prometheus Operator deployment to remove it's limits with:

```bash
oc -n openshift-operators patch deployment.apps/observability-operator --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits"}]'
oc -n openshift-operators patch deployment.apps/observability-operator-prometheus-operator --type='json' -p='[{"op": "remove", "path": "/spec/template/spec/containers/0/resources/limits"}]'
```

6. Run the load tests with `./hack/loadtest/test.sh`

7. Using the OpenShift UI in the Developer tab, navigate to Observe and input the following querries.
1. For memory we should look at `container_memory_rss` as that is the metric used by kubelet to OOM kill the container
2. For CPU we should look at `container_cpu_usage_seconds_total` as that is the metric used by kubelet

```bash
# PromQL for memory
container_memory_rss{container!~"|POD", namespace="openshift-operators"}
# PromQL for CPU
sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace='openshift-operators'}) by (pod)
```

8. Take for both OO and Prometheus Operator measurements of their preformance
1. Establish a baseline for both CPU and memory (minimum they consume), those will be our `requests`
2. Multiply that value by 3 and validate that it fits the intervals of values observed, those will be our `limits`
3. Give some extra head room to `limits` to anticipate feature growth
58 changes: 58 additions & 0 deletions hack/loadtest/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env bash

set -e -u -o pipefail
trap cleanup INT

# Functions that given a number it creates a namespace
# and in that namespace it creates a monitoring stack
create_monitoring_stack() {

local stack_number=$1; shift
local ms_name=stack-$stack_number
local namespace=loadtest-$stack_number

monitoring_stack=$(cat <<- EOF
apiVersion: monitoring.rhobs/v1alpha1
kind: MonitoringStack
metadata:
name: ${ms_name}
namespace: ${namespace}
labels:
load-test: test
spec:
logLevel: debug
retention: 15d
resourceSelector:
matchLabels:
load-test-instance: ${ms_name}
EOF
)

kubectl create namespace "$namespace"
echo "$monitoring_stack" | kubectl -n "$namespace" apply -f -
}

cleanup() {
echo "INFO: cleaning up all namespaces"
kubectl delete ns loadtest-{1..10}
}

main() {
# Goal: create 10 monitoring stack CRs, wait for OO to
# reconcile and then clean-up

echo "INFO: Running load test"
for ((i=1; i<=10; i++)); do
create_monitoring_stack "$i"
done

# Give some time for OO to reconcile all the MS
# and create the necessary resources
local timeout=180
echo "INFO: sleeping for $timeout"
sleep "$timeout"

cleanup
}

main "$@"

0 comments on commit 8658ccf

Please sign in to comment.