-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throttling Beats for system stability #17775
Comments
Pinging @elastic/integrations (Team:Integrations) |
Ha anyone seen this token bucket GO implementation? |
Another use case: in cloudfoundry you may want to set an event rate limit per organization, or in kubernetes per namespace. |
Beats 7.11 will include a |
Good addition, but throttling would be even better |
@zez3 as I replied to your comment in #22883 (comment):
|
Another component of this discussion might be event size, as it falls under the category of ingestion controls for pipeline stability. For example, in one instance an organization was successful by restricting event sizes to 500kb at Kafka. |
The future is now. We will most probably move to Elastic stack this year and we kind of need this throttling implemented. |
Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration? |
|
Also follow the elastic/elastic-agent-shipper#16 |
I would close this issue now that the shipper is almost functional |
Describe the enhancement:
Several users have filed several issues requesting the ability to throttle Beats. This is usually requested in order to improve system stability and reduce impact on other applications. Since Beat are a monitoring application, they should not interrupt critical business applications. We'd like to evaluate all these requests and determine the best plan for implementation.
Describe a specific use case for the enhancement or feature:
There are several types of resources that users are concerned about:
There are several ways to mitigate these issues
I'm listing them here together because a limit on one may indirectly impose a limit on the others. Thus it may be possible to solve many (but perhaps not all) of these problems with a single solution. There are different ways to implement each of these limits, and pros and cons to each one. We should evaluate each to determine the best solution that will help the most customers.
Why system tools fall short
Historically we prefer to rely on system tools to do the rate limiting/QOS because they give operators more control. However, these tools are not accessible to all users, they may be difficult for operators to set up or configure, and they may need to implement a variety of solutions across heterogeneous systems. Providing even a simple limit is better than nothing for users who want an out of the box solution.
Currently, our docs give an example on how to configure limits using
tc
andiptables
. See:https://www.elastic.co/guide/en/beats/filebeat/current/bandwidth-throttling.html#bandwidth-throttling. Also, something that works today is to limit the Beats to a single CPU core, via the max_procs setting.
With systemd one can configure an
ExecPreStart
, andExecPostStop
script inthe systemd files. This allows users to install/remove rules as part of the
service startup. Unfortunately systemd has removed the NetClass setting, requiring using to use the
tc
tool. On linux one can also make use of net_prio + cgroups (e.g.https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-prioritizing_network_traffic), but AFAIK this is not easy to integrate with systemd, and it looks like cgroups v2 is not really settled on supported cgroup controllers.
Being able to configure the bandwidth outside the beat, more easily allows
users to adapt the rate based on time/internal policies as well. The F5 docs (and other network vendors) for example provide some docs on how to do traffic shaping: https://techdocs.f5.com/en-us/bigip-15-1-0/big-ip-access-policy-manager-network-access/shaping-traffic-on-the-network-access-client.html
Deciding on an approach
There are multiple ways to set limits, and there are pros and cons to each approach. In general bandwidth limitation can be static vs. dynamic. And maybe even a mix depending on a predefined schedule. One customer is asking for rate limiting in the application(Beats). This is
easy to configure from users POV, but potential support is limited to: do we limit based on number of events or bytes? Currently for both Beats and Logstash the unit of work is the event. Until the event hits the outputs we can not tell the actual byte usage. Applying limiting in the output would be possible to some extend, but also limited. For some outputs we have 0 control about network clients and setup. Not having any control over sockets, we can not limit by byte usage at all.
Different outputs may need different limits. The kafka client also does the batching itself, without us being able to assert much control at all. Even with rate limiting before the kafka client, our limits can not be correct, because the batching does lead to spikes/bursts, especially if the remote service was not available for some time. For the other outputs we can create our own connections, and measure bandwidth usage, but still the rate limiting would not be able to take network protocol overhead in mind.
Giving network packets dynamic priority has the advantage of having a dynamic bound. Like: give other application a higher priority, but if we have the bandwidth available now to ingest more data, then do so. But this is can only be decided by the OS or network, not the Beat to be accurate. Not having enough bandwidth available at all leads to data loss or in case of filebeat, issues with filebeat not closing file descriptors (because not all events have been published yet).
A long time ago we created a proposal for event scheduling with PR: #7082. We dropped the proposal/PR, because it would not have solved all possible requirements. Due to batching/buffering bandwidth limitation must be applied in the outputs explicitly or the network layer. This means that we will have to implement support for limiting bandwidth per output, if the client library used allows us to do so. We should either reconsider this approach or identify a better one to solve this issue.
The text was updated successfully, but these errors were encountered: