Honor kube event resysncs to handle missed watch events #22668

vjsamuel · 2020-11-19T07:03:20Z

Enhancement

What does this PR do?

This PR adds a HonorReSyncs watcher option to allow resyncs to be queued in as add events. The deduplication on the autodiscover module would ensure that we don't spin up more than one.

Why is it important?

This is needed if beats has been up and running and we add namespace level default hints. Without a resync it would require beats to be restarted. The resync will make sure that every x minutes we reconcile and ensure that these configurations are added in. It also makes sure that any missed add/update events are handled properly.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

[ ]

How to test this PR locally

http:
  enabled: true
  host: 0.0.0.0
  port: 5002

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      scope: cluster
      sync_period: 1m
      kube_config: ${HOME}/.kube/config
      namespace: test
      add_resource_metadata:
        namespace:
          enabled: true
      builders:
        - type: hints

kubectl create ns testresync
kubectl run prometheus --image=prom/prometheus -n testresync

Add annotations at the namespace level

kubectl annotate ns testresync co.elastic.module=prometheus
kubectl annotate ns testresync co.elastic.hosts='${data.host}:9090'

wait for 1 minute:
curl localhost:5002/dataset?pretty

this should how a runner for the prometheus metrics endpoint

kubectl annotate ns testresync co.elastic.hosts='${data.host}:9091' --overwrite

wait for 1 more minute and this should change teh runner to an incorrect port of 9091

delete all the annotations and this should remove the runner completely.

prior to this change, adding namespace level annotations would require restarting of beats to take effect

Related issues

Use cases

elasticmachine · 2020-11-19T07:09:41Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Started by user Chris Mark
Start Time: 2020-12-18T19:21:35.951+0000
Duration: 55 min 45 sec

Test stats 🧪

Test	Results
Failed	0
Passed	17504
Skipped	1383
Total	18887

Steps errors

Expand to view the steps failures

`Terraform Apply on x-pack/metricbeat/module/aws`

Took 0 min 14 sec . View more details on here
Description: terraform apply -auto-approve

`Terraform Apply on x-pack/metricbeat/module/aws`

Took 0 min 15 sec . View more details on here
Description: terraform apply -auto-approve

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	17504
Skipped	1383
Total	18887

elasticmachine · 2020-11-19T18:54:41Z

Pinging @elastic/integrations-platforms (Team:Platforms)

vjsamuel · 2020-12-02T06:28:52Z

CI failure seems like a flake, tried running on my local and it seems to pass.

vjsamuel · 2020-12-02T06:29:43Z

@jsoriano @ChrsMark can you please help this one too? this has been outstanding for quite a while and is a critical gap in the namespace annotation fallback support that we added. without this PR, beats wont pick up hints on the namespace without being restarted.

jsoriano

Hey @vjsamuel,

As I commented in the original PR #16322 (review), I think we should explicitly watch changes in namespaces to update pods instead of relying in pod resyncs for that. We plan to do some refactors in this direction, but I don't think they will arrive sooner than in 7.12, so let's go on with this PR by now.

Added a couple of small questions.

Thanks!

jsoriano · 2020-12-02T09:55:38Z

libbeat/autodiscover/providers/kubernetes/kubernetes.go

 		}
 	}

+	// Since all the events belong to the same event ID pick on and add in all the configs


Could you please add a check in this method so it reports some error (or even panics) if not all events have the same ID? I am worried that in a future refactor we might be tempted to reuse this method to publish pod and container events in the same call, and they will have different ids (commented in the old PR #16322 (comment)).

libbeat/autodiscover/providers/kubernetes/service_test.go

jsoriano · 2020-12-02T10:11:59Z

jenkins run the tests please

vjsamuel · 2020-12-02T18:12:31Z

@jsoriano thanks for the patience on this one :) the problem with not doing something like this as i had mentioned last time is that reconciliation is a must for kubernetes control loops. there is no guarantee that every WATCH event that is emitted is handled by the client. that being said, we would periodically have to reconcile. this PR primarily tries to ensure that we cover those kinds of issues as we have seen times where something wasnt being monitored and started to work properly when beats got restarted.

ChrsMark · 2020-12-03T09:16:02Z

libbeat/autodiscover/providers/kubernetes/kubernetes.go

+	event := bus.Event(common.MapStr(events[0]).Clone())
+	// Remove the port to avoid ambiguity during debugging
+	delete(event, "port")
+	event["config"] = configs


Wondering if we could avoid this merge and check the configs differently in autodiscover runner. Feel like this will make things more complex and hard for debugging in the future. wdyt?

the problem is that we wouldnt know if a config was removed during a resync if we dont compare them together.

ChrsMark

I think we can go on with this for now. What is still missing before merge:

Can we comment about the logic behind HonorReSyncs within the code where it is used so as to make the reading more easy for the future? I left a comment about it inline. Thanks.
@jsoriano questions/comments are not addressed
Manual testing notes are missing from the PR description. Can you please add this with the cases you have been testing?

ChrsMark · 2020-12-10T08:34:28Z

libbeat/common/kubernetes/watcher.go

@@ -137,6 +139,8 @@ func NewWatcher(client kubernetes.Interface, resource Resource, opts WatchOption
 		UpdateFunc: func(o, n interface{}) {
 			if opts.IsUpdated(o, n) {
 				w.enqueue(n, update)
+			} else if opts.HonorReSyncs {
+				w.enqueue(n, add)


Can you also please elaborate why in this case we push the event into the add queue and not in the update? Maybe add a comment about it for the future reader?

i have done the needful

Reshuffle pod events and container events Fix test Fix incorrect reconciliation

vjsamuel · 2020-12-15T05:26:48Z

@ChrsMark @jsoriano i have updated the PR with the required things. i am open to designing this better in follow up PRs. I can do them personally based on recommendations as well.

vjsamuel · 2020-12-15T05:36:13Z

test plan added.

ChrsMark

lgtm

ChrsMark · 2020-12-15T09:21:01Z

@ChrsMark @jsoriano i have updated the PR with the required things. i am open to designing this better in follow up PRs. I can do them personally based on recommendations as well.

Created an issue to discuss about this and draft a solution: #23139

(cherry picked from commit 6678a66)

jsoriano · 2020-12-29T16:31:34Z

Test added here seems to be flaky, not sure if this is an actual bug. I have opened an issue for investigation: #23319

) (cherry picked from commit 6678a66) Co-authored-by: Vijay Samuel <vjsamuel@ebay.com>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020

vjsamuel force-pushed the handle_kube_resync branch from a7e4364 to fc82322 Compare November 19, 2020 07:04

andresrc added the Team:Platforms Label for the Integrations - Platforms team label Nov 19, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 19, 2020

vjsamuel force-pushed the handle_kube_resync branch 2 times, most recently from 5013c59 to e733b94 Compare December 1, 2020 22:38

jsoriano reviewed Dec 2, 2020

View reviewed changes

ChrsMark self-requested a review December 2, 2020 10:12

jsoriano assigned ChrsMark Dec 2, 2020

ChrsMark added needs testing notes test-plan Add this PR to be manual test plan labels Dec 2, 2020

ChrsMark reviewed Dec 3, 2020

View reviewed changes

ChrsMark reviewed Dec 10, 2020

View reviewed changes

vjsamuel added 2 commits December 14, 2020 21:10

Honor kube event resysncs to handle missed watch events

781c13e

Reshuffle pod events and container events Fix test Fix incorrect reconciliation

Incorporate review comments

8d585ec

vjsamuel force-pushed the handle_kube_resync branch from e733b94 to 8d585ec Compare December 15, 2020 05:25

ChrsMark approved these changes Dec 15, 2020

View reviewed changes

jsoriano approved these changes Dec 15, 2020

View reviewed changes

ChrsMark merged commit 6678a66 into elastic:master Dec 18, 2020

ChrsMark mentioned this pull request Dec 18, 2020

Cherry-pick #22668 to 7.x: Honor kube event resysncs to handle missed watch events #23219

Merged

6 tasks

ChrsMark pushed a commit to ChrsMark/beats that referenced this pull request Dec 18, 2020

Honor kube event resysncs to handle missed watch events (elastic#22668)

67bbe54

(cherry picked from commit 6678a66)

ChrsMark added the v7.12.0 label Dec 18, 2020

jsoriano mentioned this pull request Dec 29, 2020

Flaky Test [Build&Test / libbeat-build / TestAutodiscoverWithMutlipleEntries – autodiscover] #23319

Closed

jsoriano added the needs_backport PR is waiting to be backported to other branches. label Dec 29, 2020

ChrsMark added a commit that referenced this pull request Jan 4, 2021

Honor kube event resysncs to handle missed watch events (#22668) (#23219

57bb61f

) (cherry picked from commit 6678a66) Co-authored-by: Vijay Samuel <vjsamuel@ebay.com>

andresrc added the test-plan-added This PR has been added to the test plan label Feb 15, 2021

jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honor kube event resysncs to handle missed watch events #22668

Honor kube event resysncs to handle missed watch events #22668

vjsamuel commented Nov 19, 2020 •

edited

Loading

elasticmachine commented Nov 19, 2020 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

`Terraform Apply on x-pack/metricbeat/module/aws`

`Terraform Apply on x-pack/metricbeat/module/aws`

Test stats 🧪

elasticmachine commented Nov 19, 2020

vjsamuel commented Dec 2, 2020

vjsamuel commented Dec 2, 2020

jsoriano left a comment

jsoriano Dec 2, 2020

jsoriano commented Dec 2, 2020

vjsamuel commented Dec 2, 2020

ChrsMark Dec 3, 2020

vjsamuel Dec 7, 2020

ChrsMark left a comment

ChrsMark Dec 10, 2020

vjsamuel Dec 15, 2020

vjsamuel commented Dec 15, 2020

vjsamuel commented Dec 15, 2020

ChrsMark left a comment

ChrsMark commented Dec 15, 2020

jsoriano commented Dec 29, 2020 •

edited

Loading

Honor kube event resysncs to handle missed watch events #22668

Honor kube event resysncs to handle missed watch events #22668

Conversation

vjsamuel commented Nov 19, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

elasticmachine commented Nov 19, 2020 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Steps errors

Terraform Apply on x-pack/metricbeat/module/aws

Terraform Apply on x-pack/metricbeat/module/aws

💚 Flaky test report

Test stats 🧪

elasticmachine commented Nov 19, 2020

vjsamuel commented Dec 2, 2020

vjsamuel commented Dec 2, 2020

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano Dec 2, 2020

Choose a reason for hiding this comment

jsoriano commented Dec 2, 2020

vjsamuel commented Dec 2, 2020

ChrsMark Dec 3, 2020

Choose a reason for hiding this comment

vjsamuel Dec 7, 2020

Choose a reason for hiding this comment

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark Dec 10, 2020

Choose a reason for hiding this comment

vjsamuel Dec 15, 2020

Choose a reason for hiding this comment

vjsamuel commented Dec 15, 2020

vjsamuel commented Dec 15, 2020

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark commented Dec 15, 2020

jsoriano commented Dec 29, 2020 • edited Loading

vjsamuel commented Nov 19, 2020 •

edited

Loading

elasticmachine commented Nov 19, 2020 •

edited by jenkins-beats-ci bot

Loading

`Terraform Apply on x-pack/metricbeat/module/aws`

`Terraform Apply on x-pack/metricbeat/module/aws`

jsoriano commented Dec 29, 2020 •

edited

Loading