Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #26126 to 7.x: Fix startup with failing configuration #26262

Merged
merged 8 commits into from
Jun 11, 2021

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Jun 11, 2021

Cherry-pick of PR #17391 to 7.x branch. Original message:

What does this PR do?

What is going on is a bit weird race.
Basically we start ok but troubles come on restart.

On restart we apply stored config to start MB (mb1) with failing configuration,
MB does not exit just reports Failed state because it cannot apply system/load on windows.
At this time 10 second timer to recover is started otherwise we plan a restart.

Then we pull config from fleet and decide MB should be started.
We start MB mb2 while mb1 is still running. We are starting it because check checks for terminal statuses and Failed is one of them.
This mb2 starts reports running, timer for killing mb1 is stopped. Because MB is running and it does not differentiate between processes.
mb2 fails on data.path conflict with mb1

watcher detects stopped metricbeat and mb3 is started, mb1 is still running, we dont have any information about mb1 anymore anywhere.
mb3 fails on same thing mb2 failed and exits.

mb1 still running in Failed state and we try mbX over and over again.

This PR adds some closers to watcher, some checks on Failure termination and passing proc so watcher and terminator are killing the process they were designed to kill.

While this fix works, i dont like the whole approach where we handle start/stop/restart from 4-5 places and they can be conflicting. I would like to see this redesigned. But i've spent 5 days chasing this and need some social distancing from the topic so this fix is OK at the moment for me.

How to test:

  • Build
  • Create config with fleet-server, system and endpoint integration
  • Deploy on windows machine
  • Wait for everything to settle and restart a lot of times to check if you see 3 metricbeat processes running in Task Manager.

How it behaved before the fix: we had 3 metricbeats, 2 with stable PID (one for monitoring, one is mb1) and then thirds MB process kept changing PID (crash-restart loop)

Why is it important?

Fixes #25829

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@michalpristas michalpristas self-assigned this Jun 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 11, 2021
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #26262 opened

  • Start Time: 2021-06-11T15:22:40.900+0000

  • Duration: 86 min 39 sec

  • Commit: 4ca5651

Test stats 🧪

Test Results
Failed 0
Passed 6860
Skipped 24
Total 6884

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 6860
Skipped 24
Total 6884

@michalpristas michalpristas merged commit e73b271 into elastic:7.x Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants