Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor docker watcher to fix flaky test and other small issues #21851

Merged
merged 9 commits into from
Oct 16, 2020

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Oct 15, 2020

What does this PR do?

Refactor docker watcher to fix some small issues and improve testability:

  • Actually release resources of previous connections when reconnecting.
  • Watcher uses a clock that can be mocked in tests for time-sensitive functionality.
  • Use nanoseconds-precision from events timestamps, this is important to avoid duplicated events on reconnection.
  • Fix logger initialization (it was being initialized as docker.docker).
  • Refactor test helpers to have more control on test watcher when needed.
  • Some other code refactors.

Why is it important?

  • Fixes flaky test [Libbeat] Flaky TestWatcherDie test #7906.
  • Watch loop relied on deferred calls to do cleanups, but the watch loop never finished, so resources associated to each reconnection were never released. On reconnections caused by not receiving events, previous connections were not being closed, so several connections were kept alive, but all except the last one were ignored.
  • Fix duplication of events on reconnections. Each reconnection asks for events since last received events, using seconds granularity made each reconnection to retrieve again the events happened during the second the last event was received, including the event itself. API supports nanosecond granularity, that avoids this problem.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

$ go test -count 10000 ./libbeat/common/docker/
ok  	github.com/elastic/beats/v7/libbeat/common/docker	5.220s

@jsoriano jsoriano added enhancement flaky-test Unstable or unreliable test cases. Team:Platforms Label for the Integrations - Platforms team labels Oct 15, 2020
@jsoriano jsoriano self-assigned this Oct 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 15, 2020
@jsoriano jsoriano added needs_backport PR is waiting to be backported to other branches. v7.11.0 labels Oct 15, 2020
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 15, 2020
@jsoriano jsoriano added bug needs_team Indicates that the issue/PR needs a Team:* label and removed enhancement needs_team Indicates that the issue/PR needs a Team:* label labels Oct 15, 2020
Comment on lines +292 to 295
if time.Since(lastReceivedEventTime) > dockerEventsWatchPityTimerTimeout {
w.log.Infof("No events received within %s, restarting watch call", dockerEventsWatchPityTimerTimeout)
return false
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this reconnection is needed, leaving it just in case, I guess this is to handle connections being stalled for some reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 15, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #21851 updated]

  • Start Time: 2020-10-16T13:23:07.962+0000

  • Duration: 79 min 50 sec

Test stats 🧪

Test Results
Failed 0
Passed 16349
Skipped 1343
Total 17692

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Just left some comments/questions.

libbeat/common/docker/watcher.go Outdated Show resolved Hide resolved
Comment on lines +292 to 295
if time.Since(lastReceivedEventTime) > dockerEventsWatchPityTimerTimeout {
w.log.Infof("No events received within %s, restarting watch call", dockerEventsWatchPityTimerTimeout)
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

libbeat/common/docker/watcher_test.go Show resolved Hide resolved
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

just one non-blocking suggestion

@jsoriano jsoriano merged commit 4427fa5 into elastic:master Oct 16, 2020
@jsoriano jsoriano deleted the fix-flaky-test-watcher-die branch October 16, 2020 15:54
jsoriano added a commit to jsoriano/beats that referenced this pull request Oct 16, 2020
…stic#21851)

Refactor docker watcher to fix some small issues and improve testability:
* Actually release resources of previous connections when reconnecting.
* Watcher uses a clock that can be mocked in tests for time-sensitive functionality.
* Use nanoseconds-precision from events timestamps, this is important to avoid duplicated events on reconnection.
* Fix logger initialization (it was being initialized as docker.docker).
* Refactor test helpers to have more control on test watcher when needed.
* Some other code refactors.

(cherry picked from commit 4427fa5)
@jsoriano jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Oct 16, 2020
jsoriano added a commit that referenced this pull request Oct 18, 2020
) (#21918)

Refactor docker watcher to fix some small issues and improve testability:
* Actually release resources of previous connections when reconnecting.
* Watcher uses a clock that can be mocked in tests for time-sensitive functionality.
* Use nanoseconds-precision from events timestamps, this is important to avoid duplicated events on reconnection.
* Fix logger initialization (it was being initialized as docker.docker).
* Refactor test helpers to have more control on test watcher when needed.
* Some other code refactors.

(cherry picked from commit 4427fa5)
v1v added a commit to v1v/beats that referenced this pull request Oct 19, 2020
* upstream/master: (23 commits)
  [Ingest Manager] Prevent reporting ecs version twice (elastic#21616)
  [CI] Use google storage to keep artifacts (elastic#21910)
  Update docs.asciidoc (elastic#21849)
  Kubernetes leaderelection improvements (elastic#21896)
  Apply name changes to elastic agent docs (elastic#21549)
  Add 7.7.1 relnotes to 7.8 docs (elastic#21937) (elastic#21941)
  [libbeat] Fix potential deadlock in the disk queue + add more unit tests (elastic#21930)
  Refactor docker watcher to fix flaky test and other small issues (elastic#21851)
  [CI] Add stage name in the step (elastic#21887)
  [docs] Remove extra word in autodiscover docs (elastic#21871)
  [CI] lint stage doesn't produce test reports (elastic#21888)
  Add tests of reader of filestream input (elastic#21814)
  [Ingest Manager] Use local temp instead of system one (elastic#21883)
  chore: delegate variant pushes to the right method (elastic#21861)
  [CI] kind setup fails sometimes (elastic#21857)
  Fix panic on add_docker_metadata close (elastic#21882)
  Add tests for fileProspector in filestream input (elastic#21712)
  [Filebeat][okta] Fix okta pagination (elastic#21797)
  Add cloud.account.id into add_cloud_metadata for gcp (elastic#21776)
  Fix syslog RFC 5424 parsing in CheckPoint module (elastic#21854)
  ...
v1v added a commit to v1v/beats that referenced this pull request Oct 19, 2020
* upstream/master: (23 commits)
  [Ingest Manager] Prevent reporting ecs version twice (elastic#21616)
  [CI] Use google storage to keep artifacts (elastic#21910)
  Update docs.asciidoc (elastic#21849)
  Kubernetes leaderelection improvements (elastic#21896)
  Apply name changes to elastic agent docs (elastic#21549)
  Add 7.7.1 relnotes to 7.8 docs (elastic#21937) (elastic#21941)
  [libbeat] Fix potential deadlock in the disk queue + add more unit tests (elastic#21930)
  Refactor docker watcher to fix flaky test and other small issues (elastic#21851)
  [CI] Add stage name in the step (elastic#21887)
  [docs] Remove extra word in autodiscover docs (elastic#21871)
  [CI] lint stage doesn't produce test reports (elastic#21888)
  Add tests of reader of filestream input (elastic#21814)
  [Ingest Manager] Use local temp instead of system one (elastic#21883)
  chore: delegate variant pushes to the right method (elastic#21861)
  [CI] kind setup fails sometimes (elastic#21857)
  Fix panic on add_docker_metadata close (elastic#21882)
  Add tests for fileProspector in filestream input (elastic#21712)
  [Filebeat][okta] Fix okta pagination (elastic#21797)
  Add cloud.account.id into add_cloud_metadata for gcp (elastic#21776)
  Fix syslog RFC 5424 parsing in CheckPoint module (elastic#21854)
  ...
v1v added a commit to v1v/beats that referenced this pull request Oct 19, 2020
…laky-test-analyser

* upstream/master: (22 commits)
  [Ingest Manager] Prevent reporting ecs version twice (elastic#21616)
  [CI] Use google storage to keep artifacts (elastic#21910)
  Update docs.asciidoc (elastic#21849)
  Kubernetes leaderelection improvements (elastic#21896)
  Apply name changes to elastic agent docs (elastic#21549)
  Add 7.7.1 relnotes to 7.8 docs (elastic#21937) (elastic#21941)
  [libbeat] Fix potential deadlock in the disk queue + add more unit tests (elastic#21930)
  Refactor docker watcher to fix flaky test and other small issues (elastic#21851)
  [CI] Add stage name in the step (elastic#21887)
  [docs] Remove extra word in autodiscover docs (elastic#21871)
  [CI] lint stage doesn't produce test reports (elastic#21888)
  Add tests of reader of filestream input (elastic#21814)
  [Ingest Manager] Use local temp instead of system one (elastic#21883)
  chore: delegate variant pushes to the right method (elastic#21861)
  [CI] kind setup fails sometimes (elastic#21857)
  Fix panic on add_docker_metadata close (elastic#21882)
  Add tests for fileProspector in filestream input (elastic#21712)
  [Filebeat][okta] Fix okta pagination (elastic#21797)
  Add cloud.account.id into add_cloud_metadata for gcp (elastic#21776)
  Fix syslog RFC 5424 parsing in CheckPoint module (elastic#21854)
  ...
v1v added a commit to v1v/beats that referenced this pull request Oct 20, 2020
…-store-in-another-location-too

* upstream/master:
  [Ingest Manager] Prevent reporting ecs version twice (elastic#21616)
  [CI] Use google storage to keep artifacts (elastic#21910)
  Update docs.asciidoc (elastic#21849)
  Kubernetes leaderelection improvements (elastic#21896)
  Apply name changes to elastic agent docs (elastic#21549)
  Add 7.7.1 relnotes to 7.8 docs (elastic#21937) (elastic#21941)
  [libbeat] Fix potential deadlock in the disk queue + add more unit tests (elastic#21930)
  Refactor docker watcher to fix flaky test and other small issues (elastic#21851)
  [CI] Add stage name in the step (elastic#21887)
  [docs] Remove extra word in autodiscover docs (elastic#21871)
  [CI] lint stage doesn't produce test reports (elastic#21888)
  Add tests of reader of filestream input (elastic#21814)
  [Ingest Manager] Use local temp instead of system one (elastic#21883)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug flaky-test Unstable or unreliable test cases. Team:Platforms Label for the Integrations - Platforms team v7.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Libbeat] Flaky TestWatcherDie test
3 participants