Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add event rate quota per Cloud Foundry organization #21020

Closed
jsoriano opened this issue Sep 8, 2020 · 14 comments · Fixed by #23330
Closed

Add event rate quota per Cloud Foundry organization #21020

jsoriano opened this issue Sep 8, 2020 · 14 comments · Fixed by #23330
Assignees
Labels
enhancement Team:Platforms Label for the Integrations - Platforms team v7.11.0

Comments

@jsoriano
Copy link
Member

jsoriano commented Sep 8, 2020

Describe the enhancement:

Add some kind of event rate quota per Cloud Foundry organization, to allow to dropping events once some limit is reached.

There is an issue about adding throttling in general (#17775), but this may require many more changes.
In the case of Cloud Foundry we could add the rate limit per organization in the specific input.

Add a field or tag to the events to indicate that they are being throttled.

Describe a specific use case for the enhancement or feature:

On clusters with many organizations it may be good to limit the events rate so organizations make a fair use of resources.

@jsoriano jsoriano added enhancement Team:Platforms Label for the Integrations - Platforms team labels Sep 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@ycombinator
Copy link
Contributor

ycombinator commented Nov 30, 2020

Discussed this with @jsoriano off-issue so wanted to summarize our discussion here:

  • There's no way to ask CF to restrict events by org when consuming them from the firehose.
  • Consequently, Beats would have to consume all the events from the firehose and then drop any events that violate the defined rate limiting policy.
  • The advantage of this approach is that we can implement this functionality as a generic rate limiting processor that can be used against any events, for example, we can use it to rate limit events based on kubernetes.namespace.
  • The disadvantage of this approach is that the rate limiting only helps with resource consumption in Beats and downstream components (ES, LS, etc.); it does not help with resource usage in CF itself.

@ycombinator
Copy link
Contributor

ycombinator commented Nov 30, 2020

Starting to think a bit about what the configuration for such a processor might look like, here's an initial proposal (not married to any of the setting names, of course):

processors:
- rate_limiter:
    global: "500/m"          # optional, but either global or by_field or both must be specified
    by_field:                # optional, but either global or by_field or both must be specified
    - field: "foo.bar"       # required
      value: "56/s"          # optional, but either value or values or both must be specified
      values:                # optional, but either value or values or both must be specified
      - baz: "4500/h"        # required

The way the above example configuration would be interpreted is:

  • For events that have foo.bar == "baz", a rate limit of 4500 events per hour will be applied.
  • For all other events that have a foo.bar field present, a rate limit of 56 events per second will be applied.
  • For all other events, a rate limit of 500 events per minute will be applied.

This configuration would allow for complex rate limiting policies to be configured while also allowing simple ones to be configured quite simply. For example, to enforce a rate limit of 500 events per second for events from cloud foundry org acme, the configuration would look like:

processors:
- rate_limiter:
    by_field:
    - field: "cloudfoundry.org.name"
      values: 
      - acme: "500 /s"

I'm deliberately leaving aside the choice of rate limiting algorithm (fixed window, sliding window, token bucket, leaky bucket) for now. Just trying to focus on the configuration UX first.

WDYT @jsoriano?

@jsoriano
Copy link
Member Author

@ycombinator thanks for your proposal, it looks quite good, but I have some of observations, maybe we can make the processor simpler.

One is that instead of having global and per field configurations, we could rely on conditional processors for more complex configurations. There are already a couple of ways of defining conditional processors, and maybe we can use them for complex configurations, so we can simplify the processor itself.
Bringing simplicity of rate_limiter to the limit, imagine for example that it can be only configured with the field name and the limit. The examples in your comment could be defined as something like this:

processors:
- if.has_fields: ['foo.bar']
  then:
    - if.equals.foo.bar: baz
      then:
      - rate_limiter:
          field: "foo.bar"
          limit: "4500/h"
      else:
      - rate_limiter:
          limit: "56/s"
  else:
    - rate_limiter:
        limit: "500/m"
processors:
- rate_limiter:
    when.equals.cloudfoundry.org.name: "acme"
    limit: "500/s"

And more usual configurations are simpler:

processors:
- rate_limit:
    field: "cloudfoundry.org.name"
    limit: "500/s"

Other point is partially a question, allowing to configure multiple fields in the same processor may lead to unexpected results. For example with the following configuration, if multiple apps with the same name exist in multiple orgs, they can be affecting one to each other:

processors:
- rate_limiter:
    by_field:
    - field: "cloudfoundry.org.name"
      value: "500/s"
    - field: "cloudfoundry.app.name"
      value: "100/s"

I guess that there can be multiple interpretations for this configuration, the result may depend on the order of the definition, how would this be interpreted? Would events discarded because of the first field be counted for the second?
We may also need to rate limit per multiple fields, so for example we can rate limit per apps that may exist in multiple organizations. For that we could have something like this:

- rate_limiter:
    by_field:
    - fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      value: "100/s"

Or this:

- rate_limiter:
    by_field:
    - field_pattern: "%{{cloudfoundry.app.name}}_%{{cloudfoundry.org.name}}"
      value: "100/s"

One last thing is about the definition of the rate limits. If multiple values are in the same processor, and they have different units e.g. 4500/h and 56/s, they may require very different granularity for buckets/windows. Maybe the unit can be a period in the config, so it is the same for all the limits in the processor:

processors:
- rate_limiter:
    period: second
    global_limit: 8
    by_field:
    - field: "foo.bar"
      limit: 56
      values:
      - baz: 1.25

But this is may be more an implementation detail, and not so needed if we simplify the processor.

@ycombinator
Copy link
Contributor

ycombinator commented Dec 1, 2020

Yeah, I had also thought of using conditionals for complex conditions but what I didn't like about it is there is a lot of repetition (and therefore chance of making errors) with the field name. So if you take the first example with conditionals:

processors:
- if.has_fields: ['foo.bar']
  then:
    - if.equals.foo.bar: baz
      then:
      - rate_limiter:
          field: "foo.bar"
          limit: "4500/h"
      else:
      - rate_limiter:
          limit: "56/s"
  else:
    - rate_limiter:
        limit: "500/m"

The field foo.bar is repeated in three places.

OTOH, I do like the readability of using conditionals — I think it's much more obvious what rate limits will be applied in which cases. So on the whole, I'm +1 to going with the conditionals approach instead of my original proposed syntax.


Regarding the question about multiple fields causing ambiguity, I like this proposal:

- rate_limiter:
    by_field:
      fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      limit: "100/s"

It's similar to the field_pattern proposal but I like this one better because it's a bit easier for users to configure IMO. IIUC the purpose of the field_pattern is mainly to build a key for the rate limit tracking. If so, I think we can build this key internally instead of asking the user to supply it via field_pattern. Or are there other use cases you were thinking of unlocking with the field_pattern idea that I'm not thinking of?


Regarding the units, I think we should try to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.

@jsoriano
Copy link
Member Author

jsoriano commented Dec 1, 2020

Regarding the question about multiple fields causing ambiguity, I like this proposal:

- rate_limiter:
    by_field:
      fields:
      - "cloudfoundry.app.name"
      - "cloudfoundry.org.name"
      limit: "100/s"

Reviewing my suggestion, I think one of the by_field/fields level is not needed, so it could be like this:

- rate_limiter:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"
    limit: "100/s"

Or do you expect to have more rate limiting methods apart of by_fields?

I think we can build this key internally instead of asking the user to supply it via field_pattern. Or are there other use cases you were thinking of unlocking with the field_pattern idea that I'm not thinking of?

Agree, better to build this key internally, it will be less error-prone. I cannot think on any legit use case that would work with field_pattern and not with fields.

Regarding the units, I think we should try to keep it as easy for users as possible, which would mean allowing them to define different units for multiple values in the same processor. I think some of this will come down to the implementation of the rate limiting algorithm. If it proves to be too difficult then we can go with your suggestion of asking the user to define a common unit for the entire processor definition. But initially I think it would be nice if we could provide more flexibility to the users and take on the resulting complexity on ourselves in the implementation.

Agree, let's keep them as you proposed by now, we can reconsider this during implementation. In any case I would only see it as a potential problem if we had multiple limits in the same processor, something we don't have in the simplified one.

@ycombinator
Copy link
Contributor

Or do you expect to have more rate limiting methods apart of by_fields?

No, let's simplify as you suggested and drop the extra level of configuration.

@ycombinator
Copy link
Contributor

BTW, I just my 1-1 with @exekias and he suggested naming this processor something like sample for two reasons:

  1. All our processors are named like actions, i.e. they are verbs, e.g. add_metadata, and
  2. Rate limiting with dropping events is basically sampling. Plus, you had also suggested off-issue that we can introduce a percentage-based sampling strategy to this processor in the future, à la the Logstash drop filter plugin.

@jsoriano
Copy link
Member Author

jsoriano commented Dec 1, 2020

Rate limiting with dropping events is basically sampling.

I slightly disagree 🙂 I agree that they do basically the same thing: dropping a part of the total events, but they have different use cases and expectations.
Rate limiting is related to the available resources you have to monitor something, if this something exceeds these limits, it cannot be properly monitored, and you are protecting the rest of the system so it doesn't affect other things.
Sampling is an optimization that you can apply to reduce the use of resources when you know that collecting a reduced quantity of data is going to give you similar information.
When you drop events because of a rate limit, you are losing information, when you drop events because of sampling, you do it in a controlled way, losing accuracy at most.

Also, we could decide to do different actions in the future. Apart of a default action: drop, we could have action: write_to_local_file or action: wait. These actions wouldn't make much sense for sampling, but can make sense for a rate limit.

In any case, we could decide to implement sampling in this processor, and if we implement both things in the same processor then I am ok with calling it sample. If not, I think we can call it rate_limit to make it an action, and call sample the one that makes sampling when/if we implement it.

Take into account that the parameters required for sampling and rate limiting are different, e.g. I may want to get 1% of the metrics, but no more than 100/s, I would need a definition like this one:

- sample:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"   
    sampling: 0.01
    limit: "100/s"

Or could be done as this with more specific processors:

- sample:
    when.has_fields: ['cloudfoundry.app.name', 'cloudfoundry.org.name']
    sampling: 0.01
- rate_limit:
    fields:
    - "cloudfoundry.app.name"
    - "cloudfoundry.org.name"   
    limit: "100/s"

@jsoriano
Copy link
Member Author

jsoriano commented Dec 1, 2020

Continuing with my previous comment, about doing sampling and rate limiting in the same processor. I am thinking now that they cover different use cases, and they probably require two completely different implementations (sampling is simpler than rate limiting), so in my opinion they should be two different processors.

@ycombinator
Copy link
Contributor

Thanks @jsoriano for your thoughtful comments on rate limiting vs. sampling. @exekias and I discussed some of these points too (off issue). In the end, I think there are enough differences, in semantics but also in options that might only make sense for either rate limiting or sampling, that I think we should make two separate processors as well.

Just to get things moving for now, I'm going to start working on a rate_limit processor PR. If we decide to including sampling in the same processor we can always make changes before the PR is merged.

@exekias
Copy link
Contributor

exekias commented Dec 2, 2020

Sounds good folks, sorry for the noise. My comment was around the fact that rate limiting doesn't necessarily imply dropping data, where sampling implies it. I think rate_limit as a name is good enough, as long as the expectations are correctly documented.

@ycombinator
Copy link
Contributor

ycombinator commented Dec 2, 2020

The following are more my own notes for writing tests and docs but posting them here in case anyone sees any issues:

Use case: rate limit all events to 10000 /m

Configuration:

processors:
- rate_limit:
   limit: "10000/m"

Use case: rate limit events from the acme Cloud Foundry org to 500 /s

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.equals.cloudfoundry.org.name: "acme"
    limit: "500/s"
processors:
- if.equals.cloudfoundry.org.name: "acme"
  then:
  - rate_limit:
      limit: "500/s"

Use case: rate limit events from the acme Cloud Foundry org and roadrunner space to 1000 /h

Configurations (each of these are alternatives to one another):

processors:
- rate_limit:
    when.and:
    - equals.cloudfoundry.org.name: "acme"
    - equals.cloudfoundry.space.name: "roadrunner"
    limit: "1000/h"
processors:
- if.and:
  - equals.cloudfoundry.org.name: "acme"
  - equals.cloudfoundry.space.name: "roadrunner"
  then:
  - rate_limit:
      limit: "1000/h"

Use case: rate limit events for each distinct Cloud Foundry org to 400 /s

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   limit: "400/s"

Use case: rate limit events for each distinct Cloud Foundry org and space combination to 20000 /h

processors:
- rate_limit:
   fields:
   - "cloudfoundry.org.name"
   - "cloudfoundry.space.name"
   limit: "20000/h"

This is a bit of a contrived use case as we could probably just use a single field, cloudfoundry.space.id, in the configuration since that has globally unique values across all orgs, but I wanted to demonstrate the idea of a field combination.

@ycombinator ycombinator mentioned this issue Dec 3, 2020
6 tasks
@ycombinator
Copy link
Contributor

The PR for a basic rate_limit processor is up here: https://github.com/elastic/infra/issues/25378.

Additionally, we will need a follow up PR for this requirement:

Add a field or tag to the events to indicate that they are being throttled.

Some quick thoughts about this requirement after discussing it with @jsoriano off-issue:

  • as rate-limited events are dropped by the rate_limit processor, the field or tag will need to go on the next event that is allowed through.
  • the field or tag name (and maybe even the value?) should probably be configurable via an optional setting on the rate_limit processor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Platforms Label for the Integrations - Platforms team v7.11.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants