Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat] Fix / clarify the module reload logic #16440

Merged
merged 14 commits into from
Feb 26, 2020
Merged

Conversation

faec
Copy link
Contributor

@faec faec commented Feb 19, 2020

What does this PR do?

The loop logic for the module reloader had a bug that would reload module configurations even if they hadn't changed. We suspect that in some cases this could cause unbounded memory consumption (we are still investigating the exact scope of the problem). This PR fixes the error and restructures the surrounding code to clarify the intended logic.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

With debug output on, any beat with module reloading enabled should output Number of module configs found: ... every time it reloads the configuration. Running a beat without this PR should show that this happens at every reload interval, whereas with it it should only reload when a file has changed.

@faec faec added bug libbeat needs_backport PR is waiting to be backported to other branches. labels Feb 19, 2020
@faec faec requested a review from ruflin February 19, 2020 21:26
@urso
Copy link

urso commented Feb 19, 2020

Fix looks good to me. Maybe we can introduce an unit, integration, or system test that makes the old behavior fail.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a changelog entry?

// Make sure the next run also updates because some runners were not properly loaded
overwriteUpdate = true
if err = list.Reload(configs); err != nil {
logp.Err("Error reloading module configurations: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if we should really log this on the error level. Could this reload return an error because a harvester was not finished yet? I just want to make sure we don't spam the error logs with this in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test runs, the results are real errors, but they're also already logged by Reload itself, so this does end up being noisy and redundant. Removed / commented.

@@ -185,7 +185,11 @@ func (rl *Reloader) Run(runnerFactory RunnerFactory) {
rl.config.Reload.Period = 0
}

overwriteUpdate := true
// If forceReload is set, the configuration should be reloaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the renaming of the variable.

@faec faec requested a review from a team as a code owner February 24, 2020 23:08
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and ok to merge for me. I really like the tests you added but I'm a bit concerned that they might become flaky. We had many such issues with timing related tests in the past.

continue
}
configReloads.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the config reload counter should only increase when it was successful? And have an additional error counter for debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant, because an error counter here wouldn't be at the right granularity -- we get an error here even if almost everything reloads successfully and there's one invalid file in the search path, and it would be misleading to report that as though the reload didn't happen.

The related monitor vars libbeat.config.module.{starts, stops, running} are defined in this file also but they aren't used at all. I was planning to enable them in a followup (it requires changing the RunnerList API and modifying other call sites or I'd have done it in this PR), but it might make sense to add errors at the same time to track reload failures at module granularity instead of the whole configuration directory.

}
// time interval is slightly more than a second so we don't slightly
// undershoot the first iteration and wait a whole extra second.
time.Sleep(1100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about tests with Sleep as they tend to get flaky earlier or later. Is there an other way to achieve this "wait"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a better way, though I'm open to suggestions :-) A spin lock would presumably be even worse. The difficulty is that the behavior being tested is itself timer-based. I did structure the logic here so that the test is only affected by the state changes and not other timing issues -- the only "race condition" here is that the unit test might wait several seconds without the reloader's 1-second-interval timer going off, in which case it has no way to tell whether anything's happening. But, the retry timeout was shorter than I meant it to be, so I just increased it to 10 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm late on this one. Lets see if it shows up in flaky tests in the future ;-)

What we did as workarounds in the past is:

  • Check for log messages
  • Have an other data point we can check very frequently for a change.

In general I think it's problematic if a single unit tests increases the test suite by seconds and only spends the time waiting. To improve the above, perhaps there is a way to decrease the scan frequency and do checks every 10ms but retry 1000 times.

@faec faec requested a review from a team as a code owner February 26, 2020 12:39
@faec
Copy link
Contributor Author

faec commented Feb 26, 2020

I finally isolated the problem with the flaky reloader tests in the CI (test_reload.py in both heartbeat and metricbeat). The tests were waiting on a log message that, with "correct" reloader behavior, should never have happened in the first place (and were thus relying on the previous broken behavior to pass).

Some details for posterity: the tests were waiting for lines like "Starting reload procedure, current runners: 1" to detect that one runner was started, but this is the log line at the beginning of the reload procedure, and so this line will generally not appear unless the configuration is changed after a runner has been started (we will instead see "current runners: 0" followed by "Start list: 1"). When we incorrectly triggered changes every iteration this was ok, and just resulted in a short delay before the target log entry appeared because of a redundant scan, but fixing the reloader prevented this.

On the other hand, the test consistently passed locally, because (as mentioned elsewhere in this PR) our glob watcher rounds down to the next earlier second, so with short timer intervals some changes can be detected twice, and so local test runs were still getting the redundant log line they needed to pass -- however, the slower CI runs often didn't hit the next scan within exactly a second, and in that case would not trigger a redundant change.

I've changed the tests to instead trigger on the less ambiguous log lines that specify how many runners were started / stopped, which is independent of the exact timer intervals and modification times.

@faec faec merged commit 2743d51 into elastic:master Feb 26, 2020
@faec faec deleted the module-reload branch February 27, 2020 17:05
faec added a commit to faec/beats that referenced this pull request Feb 28, 2020
@faec faec added v7.7.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 28, 2020
faec added a commit to faec/beats that referenced this pull request Feb 28, 2020
@faec faec added the v7.6.1 label Feb 28, 2020
@jsoriano
Copy link
Member

We suspect that in some cases this could cause unbounded memory consumption (we are still investigating the exact scope of the problem).

@faec do you have more info about the cases where this happens? I know that this is an issue at least when some processors that keep some connections are used at the module or input level. I have some WIP PRs (#16349) to look for a solution for this.

@faec
Copy link
Contributor Author

faec commented Feb 28, 2020

do you have more info about the cases where this happens?

Sort of -- that suspicion was based on a user issue that was seeing apparent memory leaks where the number of changed / loaded configurations steadily grew, which is what led to finding this bug. However, we haven't been able to reproduce the problem, because the loaded configurations (after Reload is called in this PR) are still hashed and compared before actually starting / stopping the respective runners -- this bug caused redundant configuration processing, but we haven't observed any effect on the running modules.

However if there's an additional bug somewhere that makes the hash for a particular loaded configuration unstable, then this bug combined with that one could produce the behavior mentioned. But we don't have any direct evidence that such a bug exists (we're still waiting on more configuration details in the original discussion).

faec added a commit that referenced this pull request Mar 2, 2020
faec added a commit that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants