-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Honor kube event resysncs to handle missed watch events #22668
Conversation
a7e4364
to
fc82322
Compare
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Steps errorsExpand to view the steps failures
|
Test | Results |
---|---|
Failed | 0 |
Passed | 17504 |
Skipped | 1383 |
Total | 18887 |
Pinging @elastic/integrations-platforms (Team:Platforms) |
5013c59
to
e733b94
Compare
CI failure seems like a flake, tried running on my local and it seems to pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vjsamuel,
As I commented in the original PR #16322 (review), I think we should explicitly watch changes in namespaces to update pods instead of relying in pod resyncs for that. We plan to do some refactors in this direction, but I don't think they will arrive sooner than in 7.12, so let's go on with this PR by now.
Added a couple of small questions.
Thanks!
} | ||
} | ||
|
||
// Since all the events belong to the same event ID pick on and add in all the configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a check in this method so it reports some error (or even panics) if not all events have the same ID? I am worried that in a future refactor we might be tempted to reuse this method to publish pod and container events in the same call, and they will have different ids (commented in the old PR #16322 (comment)).
jenkins run the tests please |
@jsoriano thanks for the patience on this one :) the problem with not doing something like this as i had mentioned last time is that reconciliation is a must for kubernetes control loops. there is no guarantee that every WATCH event that is emitted is handled by the client. that being said, we would periodically have to reconcile. this PR primarily tries to ensure that we cover those kinds of issues as we have seen times where something wasnt being monitored and started to work properly when beats got restarted. |
event := bus.Event(common.MapStr(events[0]).Clone()) | ||
// Remove the port to avoid ambiguity during debugging | ||
delete(event, "port") | ||
event["config"] = configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could avoid this merge and check the configs differently in autodiscover runner. Feel like this will make things more complex and hard for debugging in the future. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that we wouldnt know if a config was removed during a resync if we dont compare them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go on with this for now. What is still missing before merge:
- Can we comment about the logic behind
HonorReSyncs
within the code where it is used so as to make the reading more easy for the future? I left a comment about it inline. Thanks. - @jsoriano questions/comments are not addressed
- Manual testing notes are missing from the PR description. Can you please add this with the cases you have been testing?
@@ -137,6 +139,8 @@ func NewWatcher(client kubernetes.Interface, resource Resource, opts WatchOption | |||
UpdateFunc: func(o, n interface{}) { | |||
if opts.IsUpdated(o, n) { | |||
w.enqueue(n, update) | |||
} else if opts.HonorReSyncs { | |||
w.enqueue(n, add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please elaborate why in this case we push the event into the add
queue and not in the update
? Maybe add a comment about it for the future reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have done the needful
Reshuffle pod events and container events Fix test Fix incorrect reconciliation
e733b94
to
8d585ec
Compare
test plan added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
(cherry picked from commit 6678a66)
Test added here seems to be flaky, not sure if this is an actual bug. I have opened an issue for investigation: #23319 |
Enhancement
What does this PR do?
This PR adds a
HonorReSyncs
watcher option to allow resyncs to be queued in as add events. The deduplication on the autodiscover module would ensure that we don't spin up more than one.Why is it important?
This is needed if beats has been up and running and we add namespace level default hints. Without a resync it would require beats to be restarted. The resync will make sure that every x minutes we reconcile and ensure that these configurations are added in. It also makes sure that any missed add/update events are handled properly.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Add annotations at the namespace level
wait for 1 minute:
curl localhost:5002/dataset?pretty
this should how a runner for the prometheus metrics endpoint
kubectl annotate ns testresync co.elastic.hosts='${data.host}:9091' --overwrite
wait for 1 more minute and this should change teh runner to an incorrect port of 9091
delete all the annotations and this should remove the runner completely.
prior to this change, adding namespace level annotations would require restarting of beats to take effect
Related issues
Use cases