-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rate limit processor] Add counter metric for dropped events #23330
[Rate limit processor] Add counter metric for dropped events #23330
Conversation
Pinging @elastic/integrations (Team:Platforms) |
@@ -9,6 +9,9 @@ beta[] | |||
The `rate_limit` processor limits the throughput of events based on | |||
the specified configuration. | |||
|
|||
In the current implementation, rate-limited events are dropped. Future implementations may allow rate-limited events to | |||
be handled differently. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated to this PR but since it's minor I thought I would include it in here. Let me know if you'd prefer to have it in its own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped.
, I don't think it is needed to document possible future implementations 🙂
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
func (p *rateLimit) tagEvent(event *beat.Event) { | ||
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 { | ||
event.PutValue(p.config.MetricField, p.numRateLimited.Load()) | ||
p.numRateLimited.Store(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on the fence about whether to reset this counter each time or let it grow so it's a true counter. I decided to not go with the latter since it would be more susceptible to wraparound issues. But happy to discuss and change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include it in the event, I agree with resetting it after it is reported.
@@ -40,7 +41,10 @@ const processorName = "rate_limit" | |||
type rateLimit struct { | |||
config config | |||
algorithm algorithm | |||
logger *logp.Logger | |||
|
|||
numRateLimited atomic.Uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to keep track of the number of events that have been rate-limited (since the last time an event was allowed through; more on that in https://github.com/elastic/beats/pull/23330/files#r550323522). Alternatively we could simply have a boolean to indicate that rate limiting has happened. But I felt like a number might be more informative. Happy to discuss and change, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, it would be nice if we could include this metric as part of the libbeat monitoring infrastructure. For example, the dns
processor is doing something like this. If you agree, I can implement this as part of this PR or a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with current approach.
As an alternative I was thinking on setting only a tag
in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit
processors are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some thoughts, let me know what you think. Thanks for adding this!
@@ -9,6 +9,9 @@ beta[] | |||
The `rate_limit` processor limits the throughput of events based on | |||
the specified configuration. | |||
|
|||
In the current implementation, rate-limited events are dropped. Future implementations may allow rate-limited events to | |||
be handled differently. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped.
, I don't think it is needed to document possible future implementations 🙂
@@ -40,7 +41,10 @@ const processorName = "rate_limit" | |||
type rateLimit struct { | |||
config config | |||
algorithm algorithm | |||
logger *logp.Logger | |||
|
|||
numRateLimited atomic.Uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with current approach.
As an alternative I was thinking on setting only a tag
in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit
processors are used.
func (p *rateLimit) tagEvent(event *beat.Event) { | ||
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 { | ||
event.PutValue(p.config.MetricField, p.numRateLimited.Load()) | ||
p.numRateLimited.Store(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include it in the event, I agree with resetting it after it is reported.
return event, nil | ||
} | ||
|
||
p.numRateLimited.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Don't keep track of this if p.config.MetricField == ""
, the atomic operation could add some contention.
@@ -38,3 +41,4 @@ The following settings are supported: | |||
|
|||
`limit`:: The rate limit. Supported time units for the rate are `s` (per second), `m` (per minute), and `h` (per hour). | |||
`fields`:: (Optional) List of fields. The rate limit will be applied to each distinct value derived by combining the values of these fields. | |||
`metric_field`:: (Optional) When events are rate limited, the first event that is allowed to pass through after rate limiting will contain this field. The value of the field will be the number of events that were rate limited since the last time an event was allowed to pass through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should decide what field to use for this. This way users wouldn't need to think what field to use, with the risk of choosing a field that is used by something else. We could also provide a mapping for the field. And we could also keep this always enabled.
(Naming is complicated, I don't have a suggestion for this field 😬 )
As a example of something somehow similar, the multiline
feature of filebeat
adds a multiline
flag in a known field (log.flags
) when multiple lines are merged in a single event, and this cannot be disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall having a conversation previously where we thought it might be a good idea to give the user control of the field: #21020 (comment). But your points here about the advantages of having a fixed field are also valid.
Maybe we should take a step back and ask what is the use case for including a field in the event that describes a past state, especially considering that events may be batched and retried by some outputs. I think the idea is to know that rate limiting is happening or not and, if it is, to what extent. So maybe a monitoring counter is the more reliable way of expressing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is to know that rate limiting is happening or not and, if it is, to what extent.
Yes, I think the same.
So maybe a monitoring counter is the more reliable way of expressing this?
A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit
processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
Having a counter for each rate_limit
could be an alternative, but it can complicate things, and I am not sure how each counter could be correlated to each config.
Having the info in the event is not so nice or reliable, but I guess it is easier to identify what kind of events are being throttled.
Maybe by now we can go on with a global monitoring counter (one for all rate_limit
processors), and wait for feedback to see if this is enough. Having debug logging about the throttled events could also help with tricky cases and complicated configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
I looked at how the dns
processor solves this issue. It does so by giving each instance of the processor it's own ID and then using it for logging and metrics monitoring purposes:
beats/libbeat/processors/dns/dns.go
Lines 60 to 65 in 11c5367
// Logging and metrics (each processor instance has a unique ID). | |
var ( | |
id = int(instanceID.Inc()) | |
log = logp.NewLogger(logName).With("instance_id", id) | |
metrics = monitoring.Default.NewRegistry(logName+"."+strconv.Itoa(id), monitoring.DoNotReport) | |
) |
So maybe we could play the same game here?
Alternatively we could create a new processors.MonitoredProcessor
struct that implements processors.Processor
but also implements a MonitoringRegistry()
method that provides a monitoring registry instance for any processor that wishes to be monitored. This way we will have a consistent namespace for each processor instance within the libbeat monitoring registry for per-processor-instance metrics.
WDYT?
Having debug logging about the throttled events could also help with tricky cases and complicated configs.
Agreed. We already have this today:
p.logger.Debugf("event [%v] dropped by rate_limit processor", event)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if we have already other processors doing something like this I am ok with following the same approach 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'm going to change this PR to remove the metric field on the event and instead setup a monitoring counter for the cumulative total number of events rate limited.
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 { | ||
event.PutValue(p.config.MetricField, p.numRateLimited.Load()) | ||
p.numRateLimited.Store(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple goroutines are sending events at the same time, all of them could see that p.numRateLimited.Load() > 0
, and they would create multiple events with the same count before it is reset to 0. We could use the atomic swap to ensure that the counter is only set in one event.
if p.config.MetricField != "" && p.numRateLimited.Load() > 0 { | |
event.PutValue(p.config.MetricField, p.numRateLimited.Load()) | |
p.numRateLimited.Store(0) | |
if p.config.MetricField != "" { | |
if count := p.numRateLimited.Swap(0); count > 0 { | |
event.PutValue(p.config.MetricField, count) |
40ec456
to
7d2d900
Compare
@jsoriano I've updated this PR significantly per #23330 (comment). Please re-review when you get a chance. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added some minor comments.
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
…23330) (#23493) * [Rate limit processor] Add counter metric for dropped events (#23330) * Adding throtted_field * Documenting the field * Adding note on dropping of events * Renaming metric field * Adding CHANGELOG entry * Converting to monitoring counter metric * Removing metric_field * Fixing wrapping * Removing old entry from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> * Removing extra empty lines from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
What does this PR do?
This PR adds a monitoring counter metric,
processors.rate_limit.n.dropped
wheren
is the then
th instance (1-based) of arate_limit
processor used in a Beat's configuration. This counter is incremented each time the processor drops an event due to rate limiting.Why is it important?
This allows users of the processor to understand whether their events are being rate limited and by how much.
Checklist
I have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Create a minimal Filebeat configuration with this processor in it.
Run Filebeat with the above configuration.
Send events to Filebeat via STDIN at a rate faster than one event per minute (the rate limit).
In another window check that the Filebeat Stats API has the monitoring counter implemented by this PR and that it is incrementing as expected.
Related issues