Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor kube event resysncs to handle missed watch events #22668

Merged
merged 2 commits into from
Dec 18, 2020

Conversation

vjsamuel
Copy link
Contributor

@vjsamuel vjsamuel commented Nov 19, 2020

Enhancement

What does this PR do?

This PR adds a HonorReSyncs watcher option to allow resyncs to be queued in as add events. The deduplication on the autodiscover module would ensure that we don't spin up more than one.

Why is it important?

This is needed if beats has been up and running and we add namespace level default hints. Without a resync it would require beats to be restarted. The resync will make sure that every x minutes we reconcile and ensure that these configurations are added in. It also makes sure that any missed add/update events are handled properly.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

http:
  enabled: true
  host: 0.0.0.0
  port: 5002

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      scope: cluster
      sync_period: 1m
      kube_config: ${HOME}/.kube/config
      namespace: test
      add_resource_metadata:
        namespace:
          enabled: true
      builders:
        - type: hints
kubectl create ns testresync
kubectl run prometheus --image=prom/prometheus -n testresync

Add annotations at the namespace level

kubectl annotate ns testresync co.elastic.module=prometheus
kubectl annotate ns testresync co.elastic.hosts='${data.host}:9090'

wait for 1 minute:
curl localhost:5002/dataset?pretty

this should how a runner for the prometheus metrics endpoint

kubectl annotate ns testresync co.elastic.hosts='${data.host}:9091' --overwrite

wait for 1 more minute and this should change teh runner to an incorrect port of 9091

delete all the annotations and this should remove the runner completely.

prior to this change, adding namespace level annotations would require restarting of beats to take effect

Related issues

Use cases

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 19, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Started by user Chris Mark

  • Start Time: 2020-12-18T19:21:35.951+0000

  • Duration: 55 min 45 sec

Test stats 🧪

Test Results
Failed 0
Passed 17504
Skipped 1383
Total 18887

Steps errors 2

Expand to view the steps failures

Terraform Apply on x-pack/metricbeat/module/aws
  • Took 0 min 14 sec . View more details on here
  • Description: terraform apply -auto-approve
Terraform Apply on x-pack/metricbeat/module/aws
  • Took 0 min 15 sec . View more details on here
  • Description: terraform apply -auto-approve

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 17504
Skipped 1383
Total 18887

@andresrc andresrc added the Team:Platforms Label for the Integrations - Platforms team label Nov 19, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020
@vjsamuel vjsamuel force-pushed the handle_kube_resync branch 2 times, most recently from 5013c59 to e733b94 Compare December 1, 2020 22:38
@vjsamuel
Copy link
Contributor Author

vjsamuel commented Dec 2, 2020

CI failure seems like a flake, tried running on my local and it seems to pass.

@vjsamuel
Copy link
Contributor Author

vjsamuel commented Dec 2, 2020

@jsoriano @ChrsMark can you please help this one too? this has been outstanding for quite a while and is a critical gap in the namespace annotation fallback support that we added. without this PR, beats wont pick up hints on the namespace without being restarted.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vjsamuel,

As I commented in the original PR #16322 (review), I think we should explicitly watch changes in namespaces to update pods instead of relying in pod resyncs for that. We plan to do some refactors in this direction, but I don't think they will arrive sooner than in 7.12, so let's go on with this PR by now.

Added a couple of small questions.

Thanks!

}
}

// Since all the events belong to the same event ID pick on and add in all the configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a check in this method so it reports some error (or even panics) if not all events have the same ID? I am worried that in a future refactor we might be tempted to reuse this method to publish pod and container events in the same call, and they will have different ids (commented in the old PR #16322 (comment)).

@jsoriano
Copy link
Member

jsoriano commented Dec 2, 2020

jenkins run the tests please

@ChrsMark ChrsMark self-requested a review December 2, 2020 10:12
@ChrsMark ChrsMark added needs testing notes test-plan Add this PR to be manual test plan labels Dec 2, 2020
@vjsamuel
Copy link
Contributor Author

vjsamuel commented Dec 2, 2020

@jsoriano thanks for the patience on this one :) the problem with not doing something like this as i had mentioned last time is that reconciliation is a must for kubernetes control loops. there is no guarantee that every WATCH event that is emitted is handled by the client. that being said, we would periodically have to reconcile. this PR primarily tries to ensure that we cover those kinds of issues as we have seen times where something wasnt being monitored and started to work properly when beats got restarted.

event := bus.Event(common.MapStr(events[0]).Clone())
// Remove the port to avoid ambiguity during debugging
delete(event, "port")
event["config"] = configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could avoid this merge and check the configs differently in autodiscover runner. Feel like this will make things more complex and hard for debugging in the future. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that we wouldnt know if a config was removed during a resync if we dont compare them together.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go on with this for now. What is still missing before merge:

  1. Can we comment about the logic behind HonorReSyncs within the code where it is used so as to make the reading more easy for the future? I left a comment about it inline. Thanks.
  2. @jsoriano questions/comments are not addressed
  3. Manual testing notes are missing from the PR description. Can you please add this with the cases you have been testing?

@@ -137,6 +139,8 @@ func NewWatcher(client kubernetes.Interface, resource Resource, opts WatchOption
UpdateFunc: func(o, n interface{}) {
if opts.IsUpdated(o, n) {
w.enqueue(n, update)
} else if opts.HonorReSyncs {
w.enqueue(n, add)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please elaborate why in this case we push the event into the add queue and not in the update? Maybe add a comment about it for the future reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have done the needful

Reshuffle pod events and container events

Fix test

Fix incorrect reconciliation
@vjsamuel
Copy link
Contributor Author

@ChrsMark @jsoriano i have updated the PR with the required things. i am open to designing this better in follow up PRs. I can do them personally based on recommendations as well.

@vjsamuel
Copy link
Contributor Author

test plan added.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ChrsMark
Copy link
Member

@ChrsMark @jsoriano i have updated the PR with the required things. i am open to designing this better in follow up PRs. I can do them personally based on recommendations as well.

Created an issue to discuss about this and draft a solution: #23139

@ChrsMark ChrsMark merged commit 6678a66 into elastic:master Dec 18, 2020
ChrsMark pushed a commit to ChrsMark/beats that referenced this pull request Dec 18, 2020
@jsoriano
Copy link
Member

jsoriano commented Dec 29, 2020

Test added here seems to be flaky, not sure if this is an actual bug. I have opened an issue for investigation: #23319

@jsoriano jsoriano added the needs_backport PR is waiting to be backported to other branches. label Dec 29, 2020
ChrsMark added a commit that referenced this pull request Jan 4, 2021
)

(cherry picked from commit 6678a66)

Co-authored-by: Vijay Samuel <vjsamuel@ebay.com>
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label Feb 15, 2021
@jsoriano jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs testing notes Team:Platforms Label for the Integrations - Platforms team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants