Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thundering herd when starting many Beats at the same time. #4010

Closed
jarpy opened this issue Apr 13, 2017 · 2 comments
Closed

Thundering herd when starting many Beats at the same time. #4010

jarpy opened this issue Apr 13, 2017 · 2 comments

Comments

@jarpy
Copy link
Contributor

jarpy commented Apr 13, 2017

The timing of Beats is so accurate that they create bursty traffic when simultaneously restarted. In a large fleet with good configuration management, it's feasible that many hundreds of Beats could be restarted within one second of each other. They then proceed to stay in perfect sync.

thundering_beats

Perhaps an intial startup delay of rand(period) would be nice here?

@andrewkroh
Copy link
Member

Since the earliest implementing I was thinking that the scheduling of each individual metricset should be staggered at startup to help smooth the CPU load on a host. I hadn't consider the herd effect caused by an entire fleet restarting. The same problem will affect Beats when central monitoring is available and you can reconfigure all at once.

I think both issues will be addressed if we introduce a random delay into the startup of each metricset. Thanks for providing a visualization of the issue. We can check this again after introducing the a fix for this.

@tsg
Copy link
Contributor

tsg commented Apr 17, 2017

+1 for random delay at startup.

andrewkroh added a commit to andrewkroh/beats that referenced this issue Jun 14, 2017
Add random startup delay to each metricset to avoid the thundering herd problem. Fixes elastic#4010.
@tsg tsg closed this as completed in #4503 Jun 14, 2017
tsg pushed a commit that referenced this issue Jun 14, 2017
Add random startup delay to each metricset to avoid the thundering herd problem. Fixes #4010.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants