-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure PodDisruptionBudgetAtLimit alert is silenced #3020
Ensure PodDisruptionBudgetAtLimit alert is silenced #3020
Conversation
Pull Request Test Coverage Report for Build 9973443225Details
💛 - Coveralls |
a9d40e6
to
38c79e0
Compare
pkg/components/components.go
Outdated
{ | ||
APIGroups: stringListToSlice("monitoring.coreos.com"), | ||
Resources: stringListToSlice("alertmanagers"), | ||
Verbs: stringListToSlice("*"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid using *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Field: namespaceSelector, | ||
Namespaces: map[string]cache.Config{ | ||
operatorNamespace: {}, | ||
"openshift-monitoring": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use a const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
func (r *Reconciler) NewAlertmanagerApi() (*alertmanager.Api, error) { | ||
httpClient := http.Client{} | ||
httpClient.Transport = &http.Transport{ | ||
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid InsecureSkipVerify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i get a tls: failed to verify certificate: x509: certificate signed by unknown authority
do you know who i can avoid that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be working using /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
758a240
to
d34065c
Compare
9640e85
to
54b08c2
Compare
54b08c2
to
6f6505b
Compare
/retest |
func (r *Reconciler) startEventLoop() { | ||
go func() { | ||
for { | ||
r.events <- event.GenericEvent{ | ||
Object: &metav1.PartialObjectMetadata{}, | ||
} | ||
time.Sleep(periodicity) | ||
} | ||
}() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should understand if polling every N
seconds is acceptable,
see also:
/retest |
/override-bot |
hco-e2e-consecutive-operator-sdk-upgrades-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@tiraboschi: Overrode contexts on behalf of tiraboschi: ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@nunnatsa can you please also take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added several inline comments.
Please add unit tests to pkg/alertmanager/silences.go - you can use the test server from the golang standard library.
for { | ||
r.events <- event.GenericEvent{ | ||
Object: &metav1.PartialObjectMetadata{}, | ||
} | ||
time.Sleep(periodicity) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using the standard library time.Ticker
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please consider to have some escape mechanism; e.g.
// I think the first tick is after the duration, so we'll need to do one manually.
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
for {
select {
case <- r.ticker:
r.events <- event.GenericEvent{
Object: &metav1.PartialObjectMetadata{},
}
case <- something: // (may be a context done channel, or a termination signal channel)
return
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about dropping this event event source and simply using something like return ctrl.Result{Requeue: true, RequeueAfter: periodicity}, nil
in Reconcile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound like a hack to me.
@machadovilaca, in a second thought, why we even need this logic as a controller? It can be done as a regular go-routine. The only thing we must consider is to make sure the go-routine is still alive, but this is also true for the current go-routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhh nice, changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
controller-runtime docs recommends to useRequeueAfter
to poll services that cannot be watched
https://github.com/kubernetes-sigs/controller-runtime/blob/b33709fbf37f66e069c9607888bb6b7ef107b0c3/designs/cache_options.md?plain=1#L140-L147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, cool, but this whole thing is not k8s resource watching, but a periodic http querying, so I don't think we need the controller mechanism here. It's overkill.
pkg/alertmanager/silences.go
Outdated
} | ||
|
||
func (api *Api) ListSilences() ([]Silence, error) { | ||
req, err := http.NewRequest("GET", fmt.Sprintf("https://%s/api/v2/silences", api.host), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"GET"
==> http.MethodGet
} | ||
|
||
var amSilences []Silence | ||
err = json.NewDecoder(resp.Body).Decode(&amSilences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the decoder closes the body? if not, please add defer resp.Body.Close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pkg/alertmanager/silences.go
Outdated
return fmt.Errorf("failed to marshal silence: %w", err) | ||
} | ||
|
||
req, err := http.NewRequest("POST", fmt.Sprintf("https://%s/api/v2/silences", api.host), bytes.NewReader(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Post"
=> http.MethodPost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/alertmanager/silences.go
Outdated
} | ||
|
||
func (api *Api) DeleteSilence(id string) error { | ||
req, err := http.NewRequest("DELETE", fmt.Sprintf("https://%s/api/v2/silence/%s", api.host, id), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Delete"
=> http.MethodDelete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@nunnatsa thank you, all updated |
go func() { | ||
for range ticker.C { | ||
r.events <- event.GenericEvent{ | ||
Object: &metav1.PartialObjectMetadata{}, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first tick happens only after the duration. Should we send one event upon starting the ticker?
go func() { | |
for range ticker.C { | |
r.events <- event.GenericEvent{ | |
Object: &metav1.PartialObjectMetadata{}, | |
} | |
} | |
go func() { | |
r.events <- event.GenericEvent{ | |
Object: &metav1.PartialObjectMetadata{}, | |
} | |
for range ticker.C { | |
r.events <- event.GenericEvent{ | |
Object: &metav1.PartialObjectMetadata{}, | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, just tested it and only triggers after the first duration, updated
1ce9e33
to
0245058
Compare
hco-e2e-upgrade-operator-sdk-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nunnatsa The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
Signed-off-by: machadovilaca <machadovilaca@gmail.com>
0245058
to
f376283
Compare
/lgtm |
/retest |
1 similar comment
/retest |
@machadovilaca: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/override-bot |
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-consecutive-operator-sdk-upgrades-aws lane passed /override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure |
@nunnatsa: Overrode contexts on behalf of nunnatsa: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
The Pod Disruption Budget (PDB) prevents pod disruptions for migratable virtual machine images. If the PDB detects pod disruption, then openshift-monitoring sends a PodDisruptionBudgetAtLimit alert every 60 minutes, for every namespace, which includes virtual machine images that use the LiveMigrate eviction strategy. This is causing noise for all OpenShift Virtualization operator users.
Reviewer Checklist
Jira Ticket:
Release note: