Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure PodDisruptionBudgetAtLimit alert is silenced #3020

Merged

Conversation

machadovilaca
Copy link
Member

What this PR does / why we need it:

The Pod Disruption Budget (PDB) prevents pod disruptions for migratable virtual machine images. If the PDB detects pod disruption, then openshift-monitoring sends a PodDisruptionBudgetAtLimit alert every 60 minutes, for every namespace, which includes virtual machine images that use the LiveMigrate eviction strategy. This is causing noise for all OpenShift Virtualization operator users.

Reviewer Checklist

Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.

  • PR Message
  • Commit Messages
  • How to test
  • Unit Tests
  • Functional Tests
  • User Documentation
  • Developer Documentation
  • Upgrade Scenario
  • Uninstallation Scenario
  • Backward Compatibility
  • Troubleshooting Friendly

Jira Ticket:

https://issues.redhat.com/browse/CNV-43361
https://issues.redhat.com/browse/CNV-33834

Release note:

Ensure PodDisruptionBudgetAtLimit alert is silenced

@machadovilaca machadovilaca requested a review from tiraboschi July 9, 2024 13:40
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 9, 2024
@kubevirt-bot kubevirt-bot requested review from assafad and nunnatsa July 9, 2024 13:40
@coveralls
Copy link
Collaborator

coveralls commented Jul 9, 2024

Pull Request Test Coverage Report for Build 9973443225

Details

  • 42 of 65 (64.62%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 85.651%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/alertmanager/silences.go 42 65 64.62%
Files with Coverage Reduction New Missed Lines %
controllers/operands/operandHandler.go 1 86.18%
Totals Coverage Status
Change from base Build 9972285232: -0.2%
Covered Lines: 5253
Relevant Lines: 6133

💛 - Coveralls

@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch from a9d40e6 to 38c79e0 Compare July 9, 2024 13:52
{
APIGroups: stringListToSlice("monitoring.coreos.com"),
Resources: stringListToSlice("alertmanagers"),
Verbs: stringListToSlice("*"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid using *

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Field: namespaceSelector,
Namespaces: map[string]cache.Config{
operatorNamespace: {},
"openshift-monitoring": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

func (r *Reconciler) NewAlertmanagerApi() (*alertmanager.Api, error) {
httpClient := http.Client{}
httpClient.Transport = &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid InsecureSkipVerify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get a tls: failed to verify certificate: x509: certificate signed by unknown authority
do you know who i can avoid that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be working using /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch 2 times, most recently from 758a240 to d34065c Compare July 10, 2024 16:46
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2024
@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch from 9640e85 to 54b08c2 Compare July 11, 2024 08:38
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2024
@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch from 54b08c2 to 6f6505b Compare July 11, 2024 16:11
@machadovilaca
Copy link
Member Author

/retest

Comment on lines 62 to 76
func (r *Reconciler) startEventLoop() {
go func() {
for {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
time.Sleep(periodicity)
}
}()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiraboschi
Copy link
Member

/retest

@tiraboschi
Copy link
Member

/override-bot

@hco-bot
Copy link
Collaborator

hco-bot commented Jul 15, 2024

hco-e2e-consecutive-operator-sdk-upgrades-aws lane succeeded.
/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure
hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

In response to this:

hco-e2e-consecutive-operator-sdk-upgrades-aws lane succeeded.
/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure
hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiraboschi
Copy link
Member

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

@kubevirt-bot
Copy link
Contributor

@tiraboschi: Overrode contexts on behalf of tiraboschi: ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

In response to this:

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2024
@machadovilaca
Copy link
Member Author

@nunnatsa can you please also take a look?

Copy link
Collaborator

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added several inline comments.

Please add unit tests to pkg/alertmanager/silences.go - you can use the test server from the golang standard library.

Comment on lines 64 to 74
for {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
time.Sleep(periodicity)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using the standard library time.Ticker instead.

Copy link
Collaborator

@nunnatsa nunnatsa Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please consider to have some escape mechanism; e.g.

// I think the first tick is after the duration, so we'll need to do one manually.
r.events <- event.GenericEvent{
	Object: &metav1.PartialObjectMetadata{},
}
for {
	select {
		case <- r.ticker:
			r.events <- event.GenericEvent{
				Object: &metav1.PartialObjectMetadata{},
			}
		case <- something: // (may be a context done channel, or a termination signal channel)
			return
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about dropping this event event source and simply using something like return ctrl.Result{Requeue: true, RequeueAfter: periodicity}, nil in Reconcile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound like a hack to me.

@machadovilaca, in a second thought, why we even need this logic as a controller? It can be done as a regular go-routine. The only thing we must consider is to make sure the go-routine is still alive, but this is also true for the current go-routine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh nice, changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, cool, but this whole thing is not k8s resource watching, but a periodic http querying, so I don't think we need the controller mechanism here. It's overkill.

}

func (api *Api) ListSilences() ([]Silence, error) {
req, err := http.NewRequest("GET", fmt.Sprintf("https://%s/api/v2/silences", api.host), nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"GET" ==> http.MethodGet

}

var amSilences []Silence
err = json.NewDecoder(resp.Body).Decode(&amSilences)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the decoder closes the body? if not, please add defer resp.Body.Close()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

return fmt.Errorf("failed to marshal silence: %w", err)
}

req, err := http.NewRequest("POST", fmt.Sprintf("https://%s/api/v2/silences", api.host), bytes.NewReader(body))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Post" => http.MethodPost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

func (api *Api) DeleteSilence(id string) error {
req, err := http.NewRequest("DELETE", fmt.Sprintf("https://%s/api/v2/silence/%s", api.host, id), nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Delete" => http.MethodDelete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@machadovilaca
Copy link
Member Author

@nunnatsa thank you, all updated

Comment on lines 64 to 74
go func() {
for range ticker.C {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
}
Copy link
Collaborator

@nunnatsa nunnatsa Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first tick happens only after the duration. Should we send one event upon starting the ticker?

Suggested change
go func() {
for range ticker.C {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
}
go func() {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
for range ticker.C {
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, just tested it and only triggers after the first duration, updated

@nunnatsa nunnatsa dismissed their stale review July 16, 2024 12:23

required unit tests added

@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch from 1ce9e33 to 0245058 Compare July 16, 2024 13:03
@hco-bot
Copy link
Collaborator

hco-bot commented Jul 17, 2024

hco-e2e-upgrade-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-aws
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws
hco-e2e-upgrade-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws

In response to this:

hco-e2e-upgrade-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-aws
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws
hco-e2e-upgrade-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Collaborator

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2024
@nunnatsa
Copy link
Collaborator

/retest

@nunnatsa
Copy link
Collaborator

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 17, 2024
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
@machadovilaca machadovilaca force-pushed the silence-pod-disruption-budget-alerts branch from 0245058 to f376283 Compare July 17, 2024 11:33
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 17, 2024
@nunnatsa
Copy link
Collaborator

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2024
@machadovilaca
Copy link
Member Author

/retest

1 similar comment
@machadovilaca
Copy link
Member Author

/retest

Copy link

openshift-ci bot commented Jul 17, 2024

@machadovilaca: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws f376283 link true /test hco-e2e-upgrade-prev-operator-sdk-aws
ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure f376283 link true /test hco-e2e-consecutive-operator-sdk-upgrades-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nunnatsa
Copy link
Collaborator

/override-bot

@hco-bot
Copy link
Collaborator

hco-bot commented Jul 17, 2024

hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws

@kubevirt-bot
Copy link
Contributor

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws

In response to this:

hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nunnatsa
Copy link
Collaborator

hco-e2e-consecutive-operator-sdk-upgrades-aws lane passed

/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure

@kubevirt-bot
Copy link
Contributor

@nunnatsa: Overrode contexts on behalf of nunnatsa: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure

In response to this:

hco-e2e-consecutive-operator-sdk-upgrades-aws lane passed

/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubevirt-bot kubevirt-bot merged commit 2628100 into kubevirt:main Jul 17, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants