Refactor smoketest script #468

leifmadsen · 2023-09-15T20:38:32Z

Perform a bit of smoketest refactoring and fix up a few bugs.

Update alert trigger to use startsAt in order to potentially speed up
delivery of the alerts. Failures in the SNMP_WEBHOOK_STATUS seems to
be primarily to delayed alert notification through
prometheus-snmp-webhook.
Add an alert clean up task as part of the clean up logic at the end.
Update openssl x509 to not use the -in flag which seems unnecessary
and on some systems causes a failure.
Add new SMOKETEST_VERBOSE boolean so local testing can skip massive
amounts of information dumped to stdout.
Remove curl pod using label selector for slightly cleaner output.
Update failure check to combine RET and SNMP_WEBHOOK_STATUS since
testing seems to show changes are slightly more reliable.

leifmadsen · 2023-09-15T20:48:56Z

tests/smoketest/smoketest.sh

-oc logs "$(oc get pod -l prometheus=default -o jsonpath='{.items[0].metadata.name}')" -c prometheus
-echo
+echo "*** [INFO] Checking that the qdr certificate has a long expiry"
+EXPIRETIME=$(oc get secret default-interconnect-openstack-ca -o json | grep \"tls.crt\"\: | awk -F '": "' '{print $2}' | rev | cut -c3- | rev | base64 -d | openssl x509 -text | grep "Not After" | awk -F " : " '{print $2}')


My machine fails with the following error when using the -in - flag with openssl. Found a post from 4 years ago that says it is unnecessary? Guess we'll find out if maybe it's a difference between older containers and my more modern desktop.

Could not open file or uri for loading certificate from - 002EB2F38A7F0000:error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:crypto/store/store_register.c:237:scheme=file 002EB2F38A7F0000:error:80000002:system library:file_open:No such file or directory:providers/implementations/storemgmt/file_store.c:267:calling stat(-) Unable to load certificate

leifmadsen · 2023-09-15T20:49:11Z

tests/smoketest/smoketest.sh

+EXPIRETIME_UNIX=$(date -d "${EXPIRETIME}" "+%s")
+TARGET_UNIX=$(date -d "now + 7 years" "+%s")
+if [ ${EXPIRETIME_UNIX} -lt ${TARGET_UNIX} ]; then
+ echo "[FAILURE] Certificate expire time (${EXPIRETIME}) less than 7 years from now"


Should this actually cause a failure in our testing?

I've seen this fail if there's sufficient time between deploying STF and running the smoketests.

I don't think that kind of test is really useful, especially if you rotate your certificates more often.
Maybe a test to make sure the certificate is valid and the value is not the default? i.e. confirm that we set the right vars to actually update the expiry time.

Well the check here is to make sure that the certificate was created with a long lived certificate by default. This isn't a production system so the certificates aren't rotating or anything. This check was added as part of the work where we changed out default certificate life to cover RHOSP lifecycle so we weren't running into issues with users having their connections fail due to certificate expiry. The check validates that STF is requesting certificates for that long value and doesn't regress it.

Yeah, this is a regression test; but we don't have a regression test suite anywhere, so it ended up in the smoke test

leifmadsen · 2023-09-15T20:49:59Z

tests/smoketest/smoketest.sh

-echo "*** [INFO] Logs from snmp webhook..."
-oc logs "$(oc get pod -l app=default-snmp-webhook -o jsonpath='{.items[0].metadata.name}')"
+echo "*** [INFO] Showing servicemonitors..."
+oc get servicemonitors.monitoring.rhobs -o yaml


Need to also remember to go through docs and make sure we call things out in expanded form because the defaults will look everything up via *.monitoring.coreos.com.

leifmadsen · 2023-09-18T14:47:23Z

Actually I thought both OCP 4.12 and 4.10 failed, but it seems that the OCP 4.10 CI system failed again. I've done local testing and gotten multiple passes to happen, and it looks like OCP 4.12 has also passed. However I see this at the very end:

*** [FAILURE] Smoke test job still not succeeded after 300s

Then I can see in the logs output for the snmp-webhook receiver that it never got an alert:

*** [INFO] Logs from snmp webhook...
DEBUG:prometheus_webhook_snmp.utils:Configuration settings: {"debug": true, "snmp_host": "192.168.24.254", "snmp_port": 162, "snmp_community": "public", "snmp_retries": 5, "snmp_timeout": 1, "alert_oid_label": "oid", "trap_oid_prefix": "1.3.6.1.4.1.50495.15", "trap_default_oid": "1.3.6.1.4.1.50495.15.1.2.1", "trap_default_severity": "", "host": "0.0.0.0", "port": 9099, "metrics": false, "cert": "", "key": ""}
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Listening for SIGTERM.
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Listening for SIGHUP.
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Listening for SIGUSR1.
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Bus STARTING
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Serving on http://0.0.0.0:9099
INFO:cherrypy.error:[15/Sep/2023:20:55:15] ENGINE Bus STARTED

And I see near the top that the curl pod actually errored out (not sure why...):

*** [INFO] Creating smoketest jobs...
No resources found
job.batch/stf-smoketest-smoke1 created
*** [INFO] Triggering an alertmanager notification...
*** [INFO] Create alert
No resources found
pod/curl created
*** [INFO] Waiting to see SNMP trap message in webhook pod
*** [INFO] Waiting on job/stf-smoketest-smoke1...
job.batch/stf-smoketest-smoke1 condition met
*** [INFO] Checking that the qdr certificate has a long expiry
*** [INFO] Showing oc get all...
NAME                                                            READY   STATUS      RESTARTS   AGE
pod/alertmanager-default-0                                      3/3     Running     0          3m17s
pod/cert-manager-operator-controller-manager-7fccfddcbb-bnrt4   2/2     Running     0          14m
pod/curl                                                        0/1     Error       0          2m31s

Not sure why the smoketest stage actually passed then. Seems like I must have broken something there.

Perform a bit of smoketest refactoring and fix up a few bugs. * Update alert trigger to use startsAt in order to potentially speed up delivery of the alerts. Failures in the SNMP_WEBHOOK_STATUS seems to be primarily to delayed alert notification through prometheus-snmp-webhook. * Add an alert clean up task as part of the clean up logic at the end. * Update openssl x509 to not use the -in flag which seems unnecessary and on some systems causes a failure. * Add new SMOKETEST_VERBOSE boolean so local testing can skip massive amounts of information dumped to stdout. * Remove curl pod using label selector for slightly cleaner output. * Update failure check to combine RET and SNMP_WEBHOOK_STATUS since testing seems to show changes are slightly more reliable.

leifmadsen · 2023-09-18T14:54:04Z

tests/smoketest/smoketest.sh

-fi
-
-if [ $RET -eq 0 ]; then
+if [ $RET -eq 0 ] && [ $SNMP_WEBHOOK_STATUS -eq 0 ]; then
 echo "*** [SUCCESS] Smoke test job completed successfully"
 else
 echo "*** [FAILURE] Smoke test job still not succeeded after ${JOB_TIMEOUT}"


Oh I see the issue here now. I should actually have added the exit 1 here because the script will never fail if SNMP_WEBHOOK_STATUS fails now due to my combination of the checks here. Since the $RET value was 0 it passed the job.

Actually I kind of discussed in chat about maybe having the snmp-webhook test be a bit of an optional pass, in which case, maybe this is working as designed now? :)

I would vote for keep SNMP as experimental (meaning, don't fail the CI is SNMP tests don't pass) if it doesn't behave in a reliable way

I've run into some issues where the SNMP check is actually catching a real issue :) I'm going to end up putting it back in as a required test to pass.

csibbitt · 2023-09-18T16:16:05Z

tests/smoketest/smoketest.sh

 fi
 echo

-if [ $SNMP_WEBHOOK_STATUS -ne 0 ]; then


👍 Pretty sure this will fix a long standing annoyance where pretty much any early failure was printing "SNMP Webhook failed"

leifmadsen · 2023-09-18T18:40:20Z

OK I'm merging this down. It's landing in issue/306 so we still have an opportunity to adjust if we want, but I'm looking to get everything merged down. Passes 1 of 2 CI systems, so good enough for merging to not-master :)

* Refactor smoketest script Perform a bit of smoketest refactoring and fix up a few bugs. * Update alert trigger to use startsAt in order to potentially speed up delivery of the alerts. Failures in the SNMP_WEBHOOK_STATUS seems to be primarily to delayed alert notification through prometheus-snmp-webhook. * Add an alert clean up task as part of the clean up logic at the end. * Update openssl x509 to not use the -in flag which seems unnecessary and on some systems causes a failure. * Add new SMOKETEST_VERBOSE boolean so local testing can skip massive amounts of information dumped to stdout. * Remove curl pod using label selector for slightly cleaner output. * Update failure check to combine RET and SNMP_WEBHOOK_STATUS since testing seems to show changes are slightly more reliable. * Show logs from curl

* [issue#306] Add missing ClusterRoles The cluster-monitoring-operator is required for STF to install. It creates the required alertmanager-main and prometheus-k8s. ClusterRoles, and STF relies on these being present. These are not present when using CRC, so ClusterRoles need to be explicitly created. The names of the ClusterRoles have been updated, in case there is some conflict when cluster-monitoring-operator is installed after STF. This is a workaround for not having cluster-monitoring-operator installed: #306 resolves #306 * Fix up the RBAC setup for prometheus-stf (#467) Fix up the RBAC changes to fully get prometheus-stf working and decoupled from prometheus-k8s. Changes to using a separate prometheus-stf ClusterRole, ClusterRoleBinding, and ServiceAccount, along with a Role and RoleBinding, all using prometheus-stf as the ServiceAccount. Also updates the Alertmanager configuration to use alertmanager-stf instead of alertmanager-main. * Fix smoketest to use prometheus-stf for token retrieval * Refactor smoketest script (#468) * Refactor smoketest script Perform a bit of smoketest refactoring and fix up a few bugs. * Update alert trigger to use startsAt in order to potentially speed up delivery of the alerts. Failures in the SNMP_WEBHOOK_STATUS seems to be primarily to delayed alert notification through prometheus-snmp-webhook. * Add an alert clean up task as part of the clean up logic at the end. * Update openssl x509 to not use the -in flag which seems unnecessary and on some systems causes a failure. * Add new SMOKETEST_VERBOSE boolean so local testing can skip massive amounts of information dumped to stdout. * Remove curl pod using label selector for slightly cleaner output. * Update failure check to combine RET and SNMP_WEBHOOK_STATUS since testing seems to show changes are slightly more reliable. * Show logs from curl * Remove nodes/metrics permission from ClusterRole As part of least priviledge work, remove the nodes/metrics permission as we're not scraping nodes for information. Everything appears to continue working in STF without this permission. * Move SCC RBAC from ClusterRole to Role Working on simplifying and reducing our access scope as much as possible. It appears moving SCC RBAC from ClusterRole to Role allows things to continue to work with Prometheus. It's possible further testing may reveal this will need to reverted. * Convert alertmanager-stf Role to ClusterRole (#473) Convert alertmanager-stf Role to ClusterRole as the tokenreviews and subjectaccessreviews resources need to be accessable at the cluster scope. * Create ClusterRoleBinding and Role for alertmanager (#475) * Create ClusterRoleBinding and Role for alertmanager Create appropriate ClusterRoleBinding and Role for alertmanager-stf, breaking out SCC into a Role vs ClusterRole to keep things in alignment to prometheus-stf RBAC setup. * Adjust smoketest.sh for SNMP webhook test failures Adjust the smoketest script to also fail when the SNMP webhook test has failed. Add a wait condition for the curl pod to complete so logs can be retrieved. * Add *RoleBinding rescue capabilities If changes happen to the ClusterRoleBinding or RoleBinding then generally the system is not going to allow you to patch the object. Adds block/rescue logic to remove the existing ClusterRoleBinding or RoleBinding before creating it when patching the object fails. --------- Co-authored-by: Leif Madsen <lmadsen@redhat.com>

leifmadsen requested review from csibbitt, elfiesmelfie and vkmc September 15, 2023 20:38

leifmadsen mentioned this pull request Sep 15, 2023

fixup/smoketest #469

Closed

leifmadsen commented Sep 15, 2023

View reviewed changes

leifmadsen force-pushed the fixup/smoketest branch from b0700e5 to b333815 Compare September 18, 2023 14:49

leifmadsen commented Sep 18, 2023

View reviewed changes

Show logs from curl

155c5af

csibbitt approved these changes Sep 18, 2023

View reviewed changes

leifmadsen merged commit 6bb6393 into issue/306 Sep 18, 2023
7 of 8 checks passed

leifmadsen deleted the fixup/smoketest branch September 18, 2023 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor smoketest script #468

Refactor smoketest script #468

leifmadsen commented Sep 15, 2023

leifmadsen Sep 15, 2023

leifmadsen Sep 15, 2023

elfiesmelfie Sep 18, 2023

leifmadsen Sep 18, 2023

csibbitt Sep 18, 2023

leifmadsen Sep 15, 2023

leifmadsen commented Sep 18, 2023

leifmadsen Sep 18, 2023

vkmc Sep 18, 2023

leifmadsen Sep 20, 2023

csibbitt Sep 18, 2023

leifmadsen commented Sep 18, 2023

Refactor smoketest script #468

Refactor smoketest script #468

Conversation

leifmadsen commented Sep 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leifmadsen commented Sep 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leifmadsen commented Sep 18, 2023