Enable metric buffering #586

nhulston · 2024-10-28T21:36:12Z

What does this PR do?

This PR adds buffering to our StatsD client to support higher volumes of custom metrics. This is accomplished by setting the maxBufferSize to a number (by default it is set to 0 for no buffering at all).

When there is no buffer, we make a request (over localhost) to the Extension every time sendDistributionMetric() is called. When there is a buffer, we keep adding custom metrics to the buffer until we flush the buffer, which happens when (1) the buffer size is reached, (2) every bufferFlushInterval milliseconds occurs (currently defaults to 1000ms), or (3) the Lambda function is done executing.

The value of maxBufferSize is the size of the payload string, not the number of metrics in the buffer. See https://github.com/brightcove/hot-shots for more documentation.

See the section at the bottom for how I calculated the buffer size. 8KB buffer size in memory is negligible in Lambda, considering the smallest memory size is 128MB.

Motivation

We have some customers who have opened support tickets complaining about metrics being dropped when sending high volumes (~300) of metrics. This is not an issue in other runtimes like Python, since Node is the only library that uses hot-shots.

Testing Guidelines

With maxBufferSize=8192, we have no problem delivering all ~8000 metrics, but for smaller buffer size or larger number of metrics, we do see some getting dropped. This makes sense and aligns with the math (see below).

Additional Notes - Areas for reviewers to focus on

Since this is an issue that affects very few users, maybe we could have an environment variable to enable buffering instead of being on by default?
Are there other implications I'm missing that these changes have? I want to make sure this won't have concerns like extra overhead or higher cold start times. However, I don't think such a small buffer would have much of an impact, and we're actually making less requests over localhost, which should slightly reduce execution time.

Types of Changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)

Math

Some math to estimate the size of the buffer maxBufferSize = 8192:

From my testing, we get rate limiting after varying number of requests, but on the low end ~200 requests
Our formula for buffer size should be: maxBufferSize = N/200*c, where N is the max number of sendDistributionMetric() calls we want to support being made at once, and c is the number of characters per metric (name + value + date + tags + other characters)
My tests had c > 100. For a liberal c=200 (since customers could have a lot tags) and N=1000, we get maxBufferSize ≈ 1024
Therefore, we should be able to comfortably support 1000 custom metrics sent at once for maxBufferSize = 1024.
However, with c=200, this is ~5 metrics per request, which defeats the purpose of having a buffer. Therefore, something like maxBufferSize = 8192 should support ~40 metrics per request.

joeyzhao2018 · 2024-10-29T02:02:15Z

maybe we could have an environment variable to enable buffering instead of being on by default?

IMHO, given that not using the buffer actually caused metrics dropping issue, it's safer to default to enabling the buffer. Then we can make it an numeric ENV if the customer want to set the buffer size to a specific number. be it higher or lower.

joeyzhao2018 · 2024-10-29T02:08:50Z

LGTM.
I don't have any concern regarding the other questions. However, I think it's worth testing the following two cases

Flushing at the execution done case: A lambda that runs shorter than the bufferFlushInterval and metrics size < buffer
Flushing at the bufferFlushInterval: A lambda that runs longer than the bufferFlushInterval and metrics size < buffer.

Please just confirm those two cases have been tested before merging. Thanks.

nhulston · 2024-10-29T13:55:35Z

LGTM. I don't have any concern regarding the other questions. However, I think it's worth testing the following two cases

Flushing at the execution done case: A lambda that runs shorter than the bufferFlushInterval and metrics size < buffer

Flushing at the bufferFlushInterval: A lambda that runs longer than the bufferFlushInterval and metrics size < buffer.

Please just confirm those two cases have been tested before merging. Thanks.

Both cases work perfectly!

Set buffer size to 1024 (1KB)

5b3b695

nhulston marked this pull request as ready for review October 28, 2024 21:40

nhulston requested a review from a team as a code owner October 28, 2024 21:40

nhulston changed the title ~~Add custom metric buffering~~ Enable metric buffering Oct 28, 2024

joeyzhao2018 approved these changes Oct 29, 2024

View reviewed changes

Raise buffer size and add comment

64992f4

nhulston merged commit c7a87e9 into main Oct 29, 2024
25 checks passed

nhulston deleted the nicholas.hulston/add-custom-metric-buffering branch October 29, 2024 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable metric buffering #586

Enable metric buffering #586

nhulston commented Oct 28, 2024 •

edited

Loading

joeyzhao2018 commented Oct 29, 2024

joeyzhao2018 commented Oct 29, 2024

nhulston commented Oct 29, 2024

Enable metric buffering #586

Enable metric buffering #586

Conversation

nhulston commented Oct 28, 2024 • edited Loading

What does this PR do?

Motivation

Testing Guidelines

Additional Notes - Areas for reviewers to focus on

Types of Changes

Check all that apply

Math

joeyzhao2018 commented Oct 29, 2024

joeyzhao2018 commented Oct 29, 2024

nhulston commented Oct 29, 2024

nhulston commented Oct 28, 2024 •

edited

Loading