Throttling Beats for system stability #17775

mostlyjason · 2020-04-16T20:37:36Z

Describe the enhancement:
Several users have filed several issues requesting the ability to throttle Beats. This is usually requested in order to improve system stability and reduce impact on other applications. Since Beat are a monitoring application, they should not interrupt critical business applications. We'd like to evaluate all these requests and determine the best plan for implementation.

Describe a specific use case for the enhancement or feature:
There are several types of resources that users are concerned about:

CPU Feature: CPU and resource throttling on Beats #2789
Network / downstream effects Support sampling and throttling #3847
File IO

There are several ways to mitigate these issues

Event rate limiting
Sampling Looking for sampling feature as packetbeat is sending too much traffic over network #1721 https://discuss.elastic.co/t/idea-rate-limit-filebeat-published-events-by-a-percentage/149333 Support sampling and throttling #3847
Event scheduling / priority queues
Bandwidth limit Support a max bytes per second on beats protocol #662
CPU limiting Feature: CPU and resource throttling on Beats #2789
IO priority

I'm listing them here together because a limit on one may indirectly impose a limit on the others. Thus it may be possible to solve many (but perhaps not all) of these problems with a single solution. There are different ways to implement each of these limits, and pros and cons to each one. We should evaluate each to determine the best solution that will help the most customers.

Why system tools fall short
Historically we prefer to rely on system tools to do the rate limiting/QOS because they give operators more control. However, these tools are not accessible to all users, they may be difficult for operators to set up or configure, and they may need to implement a variety of solutions across heterogeneous systems. Providing even a simple limit is better than nothing for users who want an out of the box solution.

Currently, our docs give an example on how to configure limits using tc and iptables. See:
https://www.elastic.co/guide/en/beats/filebeat/current/bandwidth-throttling.html#bandwidth-throttling. Also, something that works today is to limit the Beats to a single CPU core, via the max_procs setting.

With systemd one can configure an ExecPreStart, and ExecPostStop script in
the systemd files. This allows users to install/remove rules as part of the
service startup. Unfortunately systemd has removed the NetClass setting, requiring using to use the tc tool. On linux one can also make use of net_prio + cgroups (e.g.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-prioritizing_network_traffic), but AFAIK this is not easy to integrate with systemd, and it looks like cgroups v2 is not really settled on supported cgroup controllers.

Being able to configure the bandwidth outside the beat, more easily allows
users to adapt the rate based on time/internal policies as well. The F5 docs (and other network vendors) for example provide some docs on how to do traffic shaping: https://techdocs.f5.com/en-us/bigip-15-1-0/big-ip-access-policy-manager-network-access/shaping-traffic-on-the-network-access-client.html

Deciding on an approach
There are multiple ways to set limits, and there are pros and cons to each approach. In general bandwidth limitation can be static vs. dynamic. And maybe even a mix depending on a predefined schedule. One customer is asking for rate limiting in the application(Beats). This is
easy to configure from users POV, but potential support is limited to: do we limit based on number of events or bytes? Currently for both Beats and Logstash the unit of work is the event. Until the event hits the outputs we can not tell the actual byte usage. Applying limiting in the output would be possible to some extend, but also limited. For some outputs we have 0 control about network clients and setup. Not having any control over sockets, we can not limit by byte usage at all.

Different outputs may need different limits. The kafka client also does the batching itself, without us being able to assert much control at all. Even with rate limiting before the kafka client, our limits can not be correct, because the batching does lead to spikes/bursts, especially if the remote service was not available for some time. For the other outputs we can create our own connections, and measure bandwidth usage, but still the rate limiting would not be able to take network protocol overhead in mind.

Giving network packets dynamic priority has the advantage of having a dynamic bound. Like: give other application a higher priority, but if we have the bandwidth available now to ingest more data, then do so. But this is can only be decided by the OS or network, not the Beat to be accurate. Not having enough bandwidth available at all leads to data loss or in case of filebeat, issues with filebeat not closing file descriptors (because not all events have been published yet).

A long time ago we created a proposal for event scheduling with PR: #7082. We dropped the proposal/PR, because it would not have solved all possible requirements. Due to batching/buffering bandwidth limitation must be applied in the outputs explicitly or the network layer. This means that we will have to implement support for limiting bandwidth per output, if the client library used allows us to do so. We should either reconsider this approach or identify a better one to solve this issue.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-16T20:40:05Z

Pinging @elastic/integrations (Team:Integrations)

zez3 · 2020-05-05T10:41:19Z

Ha anyone seen this token bucket GO implementation?
https://github.com/ozonru/filebeat-throttle-plugin

jsoriano · 2020-07-28T08:00:19Z

Another use case: in cloudfoundry you may want to set an event rate limit per organization, or in kubernetes per namespace.

jsoriano · 2020-12-15T12:08:14Z

Beats 7.11 will include a rate_limit processor for rate limiting: #22883

zez3 · 2020-12-15T23:27:30Z

Events that exceed the rate limit are dropped

Good addition, but throttling would be even better

ycombinator · 2020-12-15T23:37:12Z

@zez3 as I replied to your comment in #22883 (comment):

Thanks for the feedback, @zez3. For now, we are starting with a rudimentary implementation that'll drop events that exceed the rate limit. In the future we may add other strategies like the ones you suggest, either as options to this processor or as separate processors.

PhaedrusTheGreek · 2020-12-18T11:16:51Z

Another component of this discussion might be event size, as it falls under the category of ingestion controls for pipeline stability. For example, in one instance an organization was successful by restricting event sizes to 500kb at Kafka.

zez3 · 2021-02-18T20:44:51Z

The future is now. We will most probably move to Elastic stack this year and we kind of need this throttling implemented.
How can I as a potential future client or current client influence the development of this feature?

cakarlen · 2024-07-11T15:38:42Z

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

zez3 · 2024-07-11T16:07:05Z

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

@cakarlen
Please see the work done in #35615

zez3 · 2024-07-11T16:08:18Z

Also follow the elastic/elastic-agent-shipper#16

zez3 · 2024-07-11T16:10:03Z

I would close this issue now that the shipper is almost functional

@jsoriano

This was referenced Apr 16, 2020

Support sampling and throttling #3847

Closed

Support a max bytes per second on beats protocol #662

Closed

mostlyjason added discuss Issue needs further discussion. libbeat Team:Integrations Label for the Integrations team labels Apr 16, 2020

mostlyjason self-assigned this Apr 16, 2020

andresrc added the [zube]: Team Triage label Apr 19, 2020

zez3 mentioned this issue May 5, 2020

Filebeat message rate limit or throttle Graylog2/collector-sidecar#395

Closed

jsoriano mentioned this issue Sep 8, 2020

Add event rate quota per Cloud Foundry organization #21020

Closed

zez3 mentioned this issue Apr 2, 2022

Allow configuration of Agent(+)Beats Internal queue (on disk queue) elastic/elastic-agent#284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throttling Beats for system stability #17775

Throttling Beats for system stability #17775

mostlyjason commented Apr 16, 2020 •

edited by andresrc

Loading

elasticmachine commented Apr 16, 2020

zez3 commented May 5, 2020

jsoriano commented Jul 28, 2020

jsoriano commented Dec 15, 2020

zez3 commented Dec 15, 2020 •

edited

Loading

ycombinator commented Dec 15, 2020

PhaedrusTheGreek commented Dec 18, 2020

zez3 commented Feb 18, 2021

cakarlen commented Jul 11, 2024

zez3 commented Jul 11, 2024 •

edited

Loading

zez3 commented Jul 11, 2024

zez3 commented Jul 11, 2024 •

edited

Loading

Throttling Beats for system stability #17775

Throttling Beats for system stability #17775

Comments

mostlyjason commented Apr 16, 2020 • edited by andresrc Loading

elasticmachine commented Apr 16, 2020

zez3 commented May 5, 2020

jsoriano commented Jul 28, 2020

jsoriano commented Dec 15, 2020

zez3 commented Dec 15, 2020 • edited Loading

ycombinator commented Dec 15, 2020

PhaedrusTheGreek commented Dec 18, 2020

zez3 commented Feb 18, 2021

cakarlen commented Jul 11, 2024

zez3 commented Jul 11, 2024 • edited Loading

zez3 commented Jul 11, 2024

zez3 commented Jul 11, 2024 • edited Loading

mostlyjason commented Apr 16, 2020 •

edited by andresrc

Loading

zez3 commented Dec 15, 2020 •

edited

Loading

zez3 commented Jul 11, 2024 •

edited

Loading

zez3 commented Jul 11, 2024 •

edited

Loading