Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttling Beats for system stability #17775

Open
mostlyjason opened this issue Apr 16, 2020 · 12 comments
Open

Throttling Beats for system stability #17775

mostlyjason opened this issue Apr 16, 2020 · 12 comments
Assignees
Labels
discuss Issue needs further discussion. libbeat Team:Integrations Label for the Integrations team [zube]: Team Triage

Comments

@mostlyjason
Copy link

mostlyjason commented Apr 16, 2020

Describe the enhancement:
Several users have filed several issues requesting the ability to throttle Beats. This is usually requested in order to improve system stability and reduce impact on other applications. Since Beat are a monitoring application, they should not interrupt critical business applications. We'd like to evaluate all these requests and determine the best plan for implementation.

Describe a specific use case for the enhancement or feature:
There are several types of resources that users are concerned about:

There are several ways to mitigate these issues

I'm listing them here together because a limit on one may indirectly impose a limit on the others. Thus it may be possible to solve many (but perhaps not all) of these problems with a single solution. There are different ways to implement each of these limits, and pros and cons to each one. We should evaluate each to determine the best solution that will help the most customers.

Why system tools fall short
Historically we prefer to rely on system tools to do the rate limiting/QOS because they give operators more control. However, these tools are not accessible to all users, they may be difficult for operators to set up or configure, and they may need to implement a variety of solutions across heterogeneous systems. Providing even a simple limit is better than nothing for users who want an out of the box solution.

Currently, our docs give an example on how to configure limits using tc and iptables. See:
https://www.elastic.co/guide/en/beats/filebeat/current/bandwidth-throttling.html#bandwidth-throttling. Also, something that works today is to limit the Beats to a single CPU core, via the max_procs setting.

With systemd one can configure an ExecPreStart, and ExecPostStop script in
the systemd files. This allows users to install/remove rules as part of the
service startup. Unfortunately systemd has removed the NetClass setting, requiring using to use the tc tool. On linux one can also make use of net_prio + cgroups (e.g.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-prioritizing_network_traffic), but AFAIK this is not easy to integrate with systemd, and it looks like cgroups v2 is not really settled on supported cgroup controllers.

Being able to configure the bandwidth outside the beat, more easily allows
users to adapt the rate based on time/internal policies as well. The F5 docs (and other network vendors) for example provide some docs on how to do traffic shaping: https://techdocs.f5.com/en-us/bigip-15-1-0/big-ip-access-policy-manager-network-access/shaping-traffic-on-the-network-access-client.html

Deciding on an approach
There are multiple ways to set limits, and there are pros and cons to each approach. In general bandwidth limitation can be static vs. dynamic. And maybe even a mix depending on a predefined schedule. One customer is asking for rate limiting in the application(Beats). This is
easy to configure from users POV, but potential support is limited to: do we limit based on number of events or bytes? Currently for both Beats and Logstash the unit of work is the event. Until the event hits the outputs we can not tell the actual byte usage. Applying limiting in the output would be possible to some extend, but also limited. For some outputs we have 0 control about network clients and setup. Not having any control over sockets, we can not limit by byte usage at all.

Different outputs may need different limits. The kafka client also does the batching itself, without us being able to assert much control at all. Even with rate limiting before the kafka client, our limits can not be correct, because the batching does lead to spikes/bursts, especially if the remote service was not available for some time. For the other outputs we can create our own connections, and measure bandwidth usage, but still the rate limiting would not be able to take network protocol overhead in mind.

Giving network packets dynamic priority has the advantage of having a dynamic bound. Like: give other application a higher priority, but if we have the bandwidth available now to ingest more data, then do so. But this is can only be decided by the OS or network, not the Beat to be accurate. Not having enough bandwidth available at all leads to data loss or in case of filebeat, issues with filebeat not closing file descriptors (because not all events have been published yet).

A long time ago we created a proposal for event scheduling with PR: #7082. We dropped the proposal/PR, because it would not have solved all possible requirements. Due to batching/buffering bandwidth limitation must be applied in the outputs explicitly or the network layer. This means that we will have to implement support for limiting bandwidth per output, if the client library used allows us to do so. We should either reconsider this approach or identify a better one to solve this issue.

@mostlyjason mostlyjason added discuss Issue needs further discussion. libbeat Team:Integrations Label for the Integrations team labels Apr 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@zez3
Copy link

zez3 commented May 5, 2020

Ha anyone seen this token bucket GO implementation?
https://github.com/ozonru/filebeat-throttle-plugin

@jsoriano
Copy link
Member

Another use case: in cloudfoundry you may want to set an event rate limit per organization, or in kubernetes per namespace.

@jsoriano
Copy link
Member

Beats 7.11 will include a rate_limit processor for rate limiting: #22883

@zez3
Copy link

zez3 commented Dec 15, 2020

Events that exceed the rate limit are dropped

Good addition, but throttling would be even better

@ycombinator
Copy link
Contributor

@zez3 as I replied to your comment in #22883 (comment):

Thanks for the feedback, @zez3. For now, we are starting with a rudimentary implementation that'll drop events that exceed the rate limit. In the future we may add other strategies like the ones you suggest, either as options to this processor or as separate processors.

@PhaedrusTheGreek
Copy link
Contributor

Another component of this discussion might be event size, as it falls under the category of ingestion controls for pipeline stability. For example, in one instance an organization was successful by restricting event sizes to 500kb at Kafka.

@zez3
Copy link

zez3 commented Feb 18, 2021

The future is now. We will most probably move to Elastic stack this year and we kind of need this throttling implemented.
How can I as a potential future client or current client influence the development of this feature?

@cakarlen
Copy link

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

@zez3
Copy link

zez3 commented Jul 11, 2024

Any updates to this issue? I would add that the ability to throttle certain Fleet integrations would be handy as some integrations are more resource-intensive than others. Maybe this would apply to an Fleet agent policy more so than an individual Fleet integration?

@cakarlen
Please see the work done in #35615

@zez3
Copy link

zez3 commented Jul 11, 2024

Also follow the elastic/elastic-agent-shipper#16

@zez3
Copy link

zez3 commented Jul 11, 2024

I would close this issue now that the shipper is almost functional

@jsoriano

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. libbeat Team:Integrations Label for the Integrations team [zube]: Team Triage
Projects
None yet
Development

No branches or pull requests

8 participants