Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SNMPtrap delivery controls #404

Merged
merged 5 commits into from
Feb 28, 2023

Conversation

leifmadsen
Copy link
Member

@leifmadsen leifmadsen commented Feb 13, 2023

Implement ability to override the default values for the SNMPtrap
alertmanager receiver via prometheus-webhook-snmp component.

Closes: STF-559
Depends-on: infrawatch/prometheus-webhook-snmp#14

Implement ability to override the default values for the SNMPtrap
alertmanager receiver via prometheus-webhook-snmp component.

Closes: STF-559
@leifmadsen leifmadsen added do-not-merge Code is not ready to be merged needs-testing Needs a round of functional testing labels Feb 13, 2023
@leifmadsen leifmadsen self-assigned this Feb 13, 2023
@leifmadsen
Copy link
Member Author

test

Run the following command to update the bundle artifacts:

operator-sdk-0.19.4 generate bundle   --metadata   --manifests   --channels unstable   --default-channel unstable
@leifmadsen
Copy link
Member Author

Marked this as "do not merge" as I've made the changes locally but have no yet imported them to a working environment for building and testing.

@elfiesmelfie
Copy link
Collaborator

test

Build out the remaining options for prometheus-webhook-snmp to allow for
finer grained controls and delivery of SNMP traps via alertmanager
alerts.
@leifmadsen
Copy link
Member Author

test

@leifmadsen
Copy link
Member Author

Performed another deployment testing a few of the new changes in this PR with the following results.

Deployment with this command:

ansible-playbook -e namespace="service-telemetry" -e __local_build_enabled=true -e __service_telemetry_snmptraps_enabled=true -e __service_telemetry_storage_ephemeral_enabled=false -e prometheus_webhook_snmp_branch="stf-559/configurable-snmp-webhook-params" -e __service_telemetry_snmptraps_community=testing -e __service_telemetry_snmptraps_target=192.168.255.255 -e __service_telemetry_snmptraps_retries=7 -e __service_telemetry_snmptraps_port=6200 -e __service_telemetry_snmptraps_timeout=5 -e __service_telemetry_trap_default_severity=warning ./build/run-ci.yaml

Results in this spec (partial):

        "spec": {
            "alerting": {
                "alertmanager": {
                    "receivers": {
                        "snmpTraps": {
                            "alertOidLabel": "oid",
                            "community": "testing",
                            "enabled": true,
                            "port": 6200,
                            "retries": 7,
                            "target": "192.168.255.255",
                            "timeout": 5,
                            "trapDefaultOid": "1.3.6.1.4.1.50495.15.1.2.1",
                            "trapDefaultSeverity": "",
                            "trapOidPrefix": "1.3.6.1.4.1.50495.15"

And everything completed:

        "* [info] CI Build complete. You can now run tests."
    ]
}

PLAY RECAP ***************************************************************************************************************************************************************************************************************************************
localhost                  : ok=114  changed=46   unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   

Everything is running:

oc get pods
NAME                                                      READY   STATUS      RESTARTS   AGE
alertmanager-default-0                                    3/3     Running     0          2m44s
default-cloud1-ceil-event-smartgateway-556d896669-vp5bp   2/2     Running     0          119s
default-cloud1-ceil-meter-smartgateway-5d5dd9965c-qwxqd   3/3     Running     0          2m12s
default-cloud1-coll-event-smartgateway-64b4786966-cw4rw   2/2     Running     0          2m1s
default-cloud1-coll-meter-smartgateway-9745cd6c6-qgk2s    3/3     Running     0          2m12s
default-cloud1-sens-meter-smartgateway-84f55b998c-247vr   3/3     Running     0          2m9s
default-interconnect-845c4b647c-7x84w                     1/1     Running     0          3m10s
default-snmp-webhook-7c9f5b769b-hzmwq                     1/1     Running     0          2m52s
elastic-operator-88bf4556f-rjdj7                          1/1     Running     0          11m
elasticsearch-es-default-0                                1/1     Running     0          2m34s
interconnect-operator-99dc7f8d8-4fgt9                     1/1     Running     0          11m
prometheus-default-0                                      3/3     Running     0          98s
prometheus-operator-6f75dffbf4-vxz7l                      1/1     Running     0          11m
prometheus-webhook-snmp-2-build                           0/1     Completed   0          5m39s
service-telemetry-operator-2-build                        0/1     Completed   0          11m
service-telemetry-operator-5c4759bb6c-twnmg               1/1     Running     0          3m41s
sg-bridge-2-build                                         0/1     Completed   0          6m45s
sg-core-2-build                                           0/1     Completed   0          10m
smart-gateway-operator-2-build                            0/1     Completed   0          10m
smart-gateway-operator-58d6449bb9-56cxz                   1/1     Running     0          3m48s

Verified everything is passed through to the changes in the same-named branch on prometheus-webhook-snmp:

oc logs -f default-snmp-webhook-7c9f5b769b-hzmwq
DEBUG:prometheus_webhook_snmp.utils:Configuration settings: {"debug": true, "snmp_host": "192.168.255.255", "snmp_port": 6200, "snmp_community": "testing", "snmp_retries": 7, "snmp_timeout": 6200, "alert_oid_label": "oid", "trap_oid_prefix": "1.3.6.1.4.1.50495.15", "trap_default_oid": "1.3.6.1.4.1.50495.15.1.2.1", "trap_default_severity": "", "host": "0.0.0.0", "port": 9099, "metrics": false, "cert": "", "key": ""}
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Listening for SIGTERM.
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Listening for SIGHUP.
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Listening for SIGUSR1.
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Bus STARTING
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Serving on http://0.0.0.0:9099
INFO:cherrypy.error:[14/Feb/2023:21:30:56] ENGINE Bus STARTED

@leifmadsen
Copy link
Member Author

Confirmation that the UI displays appropriately per CRD and CR in the bundle:

image

@leifmadsen leifmadsen added enhancement New feature or request and removed do-not-merge Code is not ready to be merged needs-testing Needs a round of functional testing labels Feb 14, 2023
@leifmadsen leifmadsen marked this pull request as ready for review February 14, 2023 22:03
@leifmadsen
Copy link
Member Author

Related changes in prometheus-webhook-snmp have been merged. I have merged the changes from master to this branch and resolved conflicts and pushed. Will merge to master once tests past. Will attempt test trigger, otherwise will request a manual execution run be fired.

@leifmadsen
Copy link
Member Author

test

@leifmadsen
Copy link
Member Author

Got an approval from Chris a while ago and now Emma has helped get all the tests passing. I am merging this down as I don't expect any further reviews.

@leifmadsen leifmadsen merged commit 16324bc into master Feb 28, 2023
@leifmadsen leifmadsen deleted the stf-559/configurable-snmp-webhook-params branch February 28, 2023 20:27
csibbitt added a commit that referenced this pull request Mar 7, 2023
* Fixes for 17.0 ir script (#380)

* Move the SNMP trap delivery checks (#381)

* Move the SNMP trap delivery checks

Move the SNMP trap delivery checks as where they are situated now seems
to cause false positives. Moves the checks closer to the end of the
smoketest run seems to result in a better change that the logs the check
is looking for have been provided.

* Use a loop to check for SNMP status with break and max time

* Lock the bundle to OCP v4.10 (#385)

* Make all certs 8yr expiry

* Revert "Make all certs 8yr expiry"

This reverts commit af35714.

* Make all certs 8yr expiry (#387)

* Make all certs 8yr expiry
* Use certificate_duration and test against generated cert
* Better messages during CI cloning

* Expand support for OCP 4.11 (#391)

* Expand support for OCP 4.11

Allow installation to be done on OCP 4.11 while updating the smoketest
jobs to support later versions of the client. Also migrate to using
community-operators CatalogSource instead of OperatorHub.io. Only enable
community-operators when the use_community strategy is enabled.

Update the token request syntax when requesting a service account token.
Add checks to look for oc client version and fail if we're using a
version that's too old.

* Make passwords safer in smoketest job template

Encapsulate the password values with double quotes to help make them
safer for consumption in the template. I had an odd situation where the
password contained a bunch of extended characters and caused the
smoketest to report an error on the template having an issue with yaml
to json.

The password contained several characters such as . and : which confused
the template. Wrapping the contents in the double quotes allowed the
smoketest to apply the job.batch template and result in a working
smoketest run.

* Force SGO checkout during build (#388)

* Replacing the placeholder namespace during the build results in a "there are local changes" error on next build
* This forces the checkout to discard that (and other!?) local changes
* Quicker dev/test loop

* Update oc to 4.11 in jenkins agent (#393)

* Update oc to 4.11 in jenkins agent

Need 4.11 for new token handling changes

* Remove OperatorHub.io as a CatalogSource (#394)

Remove the OperatorHub.io CatalogSource and instead use the
community-operators CatalogSource which is available with an OCP
installation. Ideally this will avoid some of the conflicts we've been
seeing in our CI environment. This is a short term fix as future
development will likely make use of Observability Operator to provide
the metrics data store and alert delivery mechanism.

* Changes for 4.12 (#401)

* Catalog changes
* CI change to pre-clean cert-manager-operator
  * not 100% sure this is 4.12 related, but it's new and first seen during testing 4.12

* Remove Loki from stf-run-ci (#405)

* Remove Loki from stf-run-ci

* Return "Get new operator sdk" to stf-run-ci

* GHA checkout action v2 is deprecated (#407)

The GitHub Actions checkout action v2 is deprecated and needs to move to
version 3.

* Implement SNMPtrap delivery controls (#404)

* Implement SNMPtrap delivery controls

Implement ability to override the default values for the SNMPtrap
alertmanager receiver via prometheus-webhook-snmp component.

Closes: STF-559

* Run operator-sdk generate bundle

Run the following command to update the bundle artifacts:

operator-sdk-0.19.4 generate bundle   --metadata   --manifests   --channels unstable   --default-channel unstable

* Build out the remaining SNMP options

Build out the remaining options for prometheus-webhook-snmp to allow for
finer grained controls and delivery of SNMP traps via alertmanager
alerts.

* Generate bundle contents with operator-sdk

* Implement changes for operator-sdk-1.26.0 testing (#411)

* Implement changes for operator-sdk-1.26.0 testing

Implement changes that allow testing validation via operator-sdk-1.26.0
without bumping the entire bundle generation process from
operator-sdk-0.19.4 to post-operator-sdk-1.x.

These are the same tests run for validation during product pipeline
verification.

* Adds test to verify building of the bundle image works.
* Adds KinD deployment to allow executing scorecard checks.

Related: STF-1252

* Fix properties.yaml

* Simplify use of RELEASE_VERSION variable (#412)

* Add note about why we're copying files in

* Expose ability to set certificate renewal target times (#406)

* Adds duration param for CA and endpoint certs

Replaces certificate_duration for ca_certificate_duration
and endpoint_certificate_duration. Set default value for those
to 70080h (previous value)

Removes the certificate_duration param from the Issuer
resource since it's not actually needed (see [0])

[0] https://cert-manager.io/docs/reference/api-docs/#cert-manager.io/v1.IssuerConfig

* Exposes CA and endpoint certificate duration config

Exposes certificate duration config for both ElasticSearch
and QDR

Keeps the default value in use for now. Better default values
should be discussed to be included in a follow up change.

* Fix identation for certs duration param in servicetelemetry crd

* Adds cert duration to the OLM catalog

Includes cert duration params in the OLM catalog
for both ElasticSearch and QDR

* Changes snake_case to camelCase to yaml case

Fix to match style convention

* Adds pattern expresion for certs duration

* Add certificates param to events and transport

* Exposes duration parameter in the CI script

Adds the duration parameter for both ElasticSearch and QDR
in the CI script

Also updates the OLM Catalog with the latest changes (certificates object)

* Corrects naming to certificates params in CI script

* Fix snake cae in the CI script params for cert duration

* Fix identation for transports in the deploy_stf CI script

---------

Co-authored-by: Chris Sibbitt <csibbitt@redhat.com>

---------

Co-authored-by: Leif Madsen <lmadsen@redhat.com>
Co-authored-by: Jaromír Wysoglad <jwysogla@redhat.com>
Co-authored-by: Victoria Martinez de la Cruz <victoria@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

3 participants