Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rate limit processor] Add counter metric for dropped events #23330

Merged
merged 9 commits into from
Jan 13, 2021

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Dec 30, 2020

What does this PR do?

This PR adds a monitoring counter metric, processors.rate_limit.n.dropped where n is the the nth instance (1-based) of a rate_limit processor used in a Beat's configuration. This counter is incremented each time the processor drops an event due to rate limiting.

Why is it important?

This allows users of the processor to understand whether their events are being rate limited and by how much.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Create a minimal Filebeat configuration with this processor in it.

    # filebeat.test.yml
    
    filebeat.inputs:
    - type: stdin
    
    processors:
      - rate_limit:
          limit: "1/m"
    
    output.console:
      enabled: true
      pretty: true
    
    http.enabled: true
    
  2. Run Filebeat with the above configuration.

    filebeat -c filebeat.test.yml
    
  3. Send events to Filebeat via STDIN at a rate faster than one event per minute (the rate limit).

  4. In another window check that the Filebeat Stats API has the monitoring counter implemented by this PR and that it is incrementing as expected.

    curl -s 'http://localhost:5066/stats' | jq -c '.processor.rate_limit."1".dropped' 
    

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 30, 2020
@ycombinator ycombinator added needs_backport PR is waiting to be backported to other branches. Team:Platforms Label for the Integrations - Platforms team v7.12.0 v8.0.0 labels Dec 30, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Platforms)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 30, 2020
@@ -9,6 +9,9 @@ beta[]
The `rate_limit` processor limits the throughput of events based on
the specified configuration.

In the current implementation, rate-limited events are dropped. Future implementations may allow rate-limited events to
be handled differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated to this PR but since it's minor I thought I would include it in here. Let me know if you'd prefer to have it in its own PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂

@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 30, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23330 updated

    • Start Time: 2021-01-13T14:45:56.337+0000
  • Duration: 45 min 34 sec

  • Commit: c370761

Test stats 🧪

Test Results
Failed 0
Passed 17370
Skipped 1345
Total 18715

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 17370
Skipped 1345
Total 18715

func (p *rateLimit) tagEvent(event *beat.Event) {
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {
event.PutValue(p.config.MetricField, p.numRateLimited.Load())
p.numRateLimited.Store(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on the fence about whether to reset this counter each time or let it grow so it's a true counter. I decided to not go with the latter since it would be more susceptible to wraparound issues. But happy to discuss and change!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include it in the event, I agree with resetting it after it is reported.

@@ -40,7 +41,10 @@ const processorName = "rate_limit"
type rateLimit struct {
config config
algorithm algorithm
logger *logp.Logger

numRateLimited atomic.Uint64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to keep track of the number of events that have been rate-limited (since the last time an event was allowed through; more on that in https://github.com/elastic/beats/pull/23330/files#r550323522). Alternatively we could simply have a boolean to indicate that rate limiting has happened. But I felt like a number might be more informative. Happy to discuss and change, though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, it would be nice if we could include this metric as part of the libbeat monitoring infrastructure. For example, the dns processor is doing something like this. If you agree, I can implement this as part of this PR or a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with current approach.

As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some thoughts, let me know what you think. Thanks for adding this!

@@ -9,6 +9,9 @@ beta[]
The `rate_limit` processor limits the throughput of events based on
the specified configuration.

In the current implementation, rate-limited events are dropped. Future implementations may allow rate-limited events to
be handled differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂

@@ -40,7 +41,10 @@ const processorName = "rate_limit"
type rateLimit struct {
config config
algorithm algorithm
logger *logp.Logger

numRateLimited atomic.Uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with current approach.

As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.

func (p *rateLimit) tagEvent(event *beat.Event) {
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {
event.PutValue(p.config.MetricField, p.numRateLimited.Load())
p.numRateLimited.Store(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include it in the event, I agree with resetting it after it is reported.

return event, nil
}

p.numRateLimited.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Don't keep track of this if p.config.MetricField == "", the atomic operation could add some contention.

@@ -38,3 +41,4 @@ The following settings are supported:

`limit`:: The rate limit. Supported time units for the rate are `s` (per second), `m` (per minute), and `h` (per hour).
`fields`:: (Optional) List of fields. The rate limit will be applied to each distinct value derived by combining the values of these fields.
`metric_field`:: (Optional) When events are rate limited, the first event that is allowed to pass through after rate limiting will contain this field. The value of the field will be the number of events that were rate limited since the last time an event was allowed to pass through.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should decide what field to use for this. This way users wouldn't need to think what field to use, with the risk of choosing a field that is used by something else. We could also provide a mapping for the field. And we could also keep this always enabled.

(Naming is complicated, I don't have a suggestion for this field 😬 )

As a example of something somehow similar, the multiline feature of filebeat adds a multiline flag in a known field (log.flags) when multiple lines are merged in a single event, and this cannot be disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall having a conversation previously where we thought it might be a good idea to give the user control of the field: #21020 (comment). But your points here about the advantages of having a fixed field are also valid.

Maybe we should take a step back and ask what is the use case for including a field in the event that describes a past state, especially considering that events may be batched and retried by some outputs. I think the idea is to know that rate limiting is happening or not and, if it is, to what extent. So maybe a monitoring counter is the more reliable way of expressing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is to know that rate limiting is happening or not and, if it is, to what extent.

Yes, I think the same.

So maybe a monitoring counter is the more reliable way of expressing this?

A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
Having a counter for each rate_limit could be an alternative, but it can complicate things, and I am not sure how each counter could be correlated to each config.
Having the info in the event is not so nice or reliable, but I guess it is easier to identify what kind of events are being throttled.

Maybe by now we can go on with a global monitoring counter (one for all rate_limit processors), and wait for feedback to see if this is enough. Having debug logging about the throttled events could also help with tricky cases and complicated configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.

I looked at how the dns processor solves this issue. It does so by giving each instance of the processor it's own ID and then using it for logging and metrics monitoring purposes:

// Logging and metrics (each processor instance has a unique ID).
var (
id = int(instanceID.Inc())
log = logp.NewLogger(logName).With("instance_id", id)
metrics = monitoring.Default.NewRegistry(logName+"."+strconv.Itoa(id), monitoring.DoNotReport)
)

So maybe we could play the same game here?

Alternatively we could create a new processors.MonitoredProcessor struct that implements processors.Processor but also implements a MonitoringRegistry() method that provides a monitoring registry instance for any processor that wishes to be monitored. This way we will have a consistent namespace for each processor instance within the libbeat monitoring registry for per-processor-instance metrics.

WDYT?

Having debug logging about the throttled events could also help with tricky cases and complicated configs.
Agreed. We already have this today:

p.logger.Debugf("event [%v] dropped by rate_limit processor", event)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if we have already other processors doing something like this I am ok with following the same approach 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm going to change this PR to remove the metric field on the event and instead setup a monitoring counter for the cumulative total number of events rate limited.

Comment on lines 130 to 132
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {
event.PutValue(p.config.MetricField, p.numRateLimited.Load())
p.numRateLimited.Store(0)
Copy link
Member

@jsoriano jsoriano Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple goroutines are sending events at the same time, all of them could see that p.numRateLimited.Load() > 0, and they would create multiple events with the same count before it is reset to 0. We could use the atomic swap to ensure that the counter is only set in one event.

Suggested change
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {
event.PutValue(p.config.MetricField, p.numRateLimited.Load())
p.numRateLimited.Store(0)
if p.config.MetricField != "" {
if count := p.numRateLimited.Swap(0); count > 0 {
event.PutValue(p.config.MetricField, count)

@ycombinator ycombinator changed the title [Rate limit processor] Add metric_field [Rate limit processor] Add counter metric for dropped events Jan 12, 2021
@ycombinator
Copy link
Contributor Author

@jsoriano I've updated this PR significantly per #23330 (comment). Please re-review when you get a chance. Thanks!

@ycombinator ycombinator added test-plan Add this PR to be manual test plan review labels Jan 12, 2021
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added some minor comments.

libbeat/processors/ratelimit/rate_limit.go Show resolved Hide resolved
CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
@ycombinator ycombinator merged commit 6d6da71 into elastic:master Jan 13, 2021
@ycombinator ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Jan 13, 2021
ycombinator added a commit that referenced this pull request Jan 14, 2021
…23330) (#23493)

* [Rate limit processor] Add counter metric for dropped events (#23330)

* Adding throtted_field

* Documenting the field

* Adding note on dropping of events

* Renaming metric field

* Adding CHANGELOG entry

* Converting to monitoring counter metric

* Removing metric_field

* Fixing wrapping

* Removing old entry from CHANGELOG

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

* Removing extra empty lines from CHANGELOG

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review Team:Platforms Label for the Integrations - Platforms team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.12.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add event rate quota per Cloud Foundry organization
4 participants