Introduce soft and hard limits for memory limiter #2250

tigrannajaryan · 2020-12-04T01:50:40Z

Contributes to #1121

Description

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:

2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)

Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:

2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)

This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.

Testing:
Tested manually, see above.

Documentation:
Changed README.

codecov · 2020-12-04T02:00:51Z

Codecov Report

Merging #2250 (e438161) into master (f08fd5b) will decrease coverage by 0.01%.
The diff coverage is 87.17%.

@@            Coverage Diff             @@
##           master    #2250      +/-   ##
==========================================
- Coverage   92.05%   92.03%   -0.02%     
==========================================
  Files         272      272              
  Lines       15322    15334      +12     
==========================================
+ Hits        14105    14113       +8     
- Misses        837      840       +3     
- Partials      380      381       +1

Impacted Files	Coverage Δ
processor/memorylimiter/memorylimiter.go	`90.09% <87.17%> (-1.96%)`	⬇️
internal/data/spanid.go	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f08fd5b...e438161. Read the comment docs.

tigrannajaryan · 2020-12-04T02:09:49Z

@kbrockhoff @jpkrohling it would be great to know your opinion on this.

jpkrohling

This looks great! I'll take some time later today, or early next week, to try this out under the same conditions as #1121.

processor/memorylimiter/README.md

processor/memorylimiter/memorylimiter.go

testbed/tests/trace_test.go

tigrannajaryan · 2020-12-09T17:54:50Z

Don't merge yet, I am looking into improving the limiting logic a bit more.

tigrannajaryan · 2020-12-09T20:39:19Z

@jpkrohling @bogdandrutu I made some slight changes to the overall logic. We now try a GC when we hit a soft limit, but we limit the frequency of GC when we are between soft and hard limit. If GC helps and brings it down below soft limit we don't enable dropping. This helps to avoid data losses when we just touch the soft limit but there is plenty of garbage to be collected.

jpkrohling

Just a minor comment, but LGTM already. Further concerns can be addressed based on real world usage, on follow-up PRs.

processor/memorylimiter/memorylimiter.go

tigrannajaryan · 2020-12-10T22:39:42Z

@bogdandrutu PTAL.

processor/memorylimiter/memorylimiter.go

Contributes to open-telemetry#1121 Memory limiter processor previously had only one limit. When exceeding this limit it was previously continuously calling GC. This resulted in huge CPU consumption if the check interval was small and this was forcing to use large check intervals. This in turn was resulting in lethargic response to growing memory usage and the memory limiter was not very effective in situations when memory usage were growing rapidly (e.g. when there was a big spike or when the backend was down). I changed the logic of memory limiter to be based on 2 thresholds: soft and hard. While below soft threshold the memory limiter is fully disabled. Between soft and hard limiter the limiter begins dropping incoming data but does not perform GC. Only when exceed the hard limit we perform GC. The net result is that the actually used memory is limited at the level set by soft limit and fluctuates between soft and hard limit as the garbage is created and collected. Correspondingly GC runs much more infrequently, only when the hard limit is reached and such GC immediately collects significant amount of garbage (reduces memory usage close to soft limit) and thus does not require subsequent GC calls for quite some time. I did some performance tests with the old and new approaches with 4000 Mib limit, 100,000 spans per second and with exporter completely blocked (no backend). With the old approach an interval of 100 ms causes about 450% of CPU usage once the memory limit is hit (while below limit the CPU usage is around 50%). Here is an extract of performance test output showing the moment when the limiter is hit: ``` 2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent: 7022700 items | Received: 0 items (0/sec) 2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent: 7322500 items | Received: 0 items (0/sec) 2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent: 7614100 items | Received: 0 items (0/sec) 2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent: 7863600 items | Received: 0 items (0/sec) 2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent: 8062700 items | Received: 0 items (0/sec) 2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent: 8331600 items | Received: 0 items (0/sec) <-- likely a regular GC, not at limit yet 2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent: 8620500 items | Received: 0 items (0/sec) 2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent: 8895500 items | Received: 0 items (0/sec) 2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent: 9172900 items | Received: 0 items (0/sec) <-- hit limit 2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent: 9461100 items | Received: 0 items (0/sec) 2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent: 9728900 items | Received: 0 items (0/sec) 2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent: 9994700 items | Received: 0 items (0/sec) 2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent: 10253200 items | Received: 0 items (0/sec) 2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent: 10504400 items | Received: 0 items (0/sec) 2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent: 10766200 items | Received: 0 items (0/sec) 2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent: 11008400 items | Received: 0 items (0/sec) 2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent: 11272700 items | Received: 0 items (0/sec) ``` Even the interval of 1 second was unusable with the old approach and we had to choose a longer interval to avoid performance degradation. With the new approach under the exact same conditions when using 100ms check interval the CPU usage is 50% when below memory limits and when the hard memory limits are hit the CPU usage increases to 68%. With 1 second check interval there is no measurable increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old approach). Here is an extract of performance test output showing the moment when the limiter is hit: ``` 2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent: 3796400 items | Received: 0 items (0/sec) 2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent: 4088600 items | Received: 0 items (0/sec) 2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent: 4388200 items | Received: 0 items (0/sec) 2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent: 4679900 items | Received: 0 items (0/sec) 2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent: 4972200 items | Received: 0 items (0/sec) 2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent: 5260700 items | Received: 0 items (0/sec) 2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent: 5547500 items | Received: 0 items (0/sec) 2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent: 5846700 items | Received: 0 items (0/sec) 2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent: 6131700 items | Received: 0 items (0/sec) 2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent: 6419400 items | Received: 0 items (0/sec) 2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent: 6719100 items | Received: 0 items (0/sec) 2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent: 7007300 items | Received: 0 items (0/sec) 2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent: 7306700 items | Received: 0 items (0/sec) 2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent: 7577700 items | Received: 0 items (0/sec) 2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent: 7860100 items | Received: 0 items (0/sec) <-- hit hard limit 2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent: 8160000 items | Received: 0 items (0/sec) 2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent: 8447900 items | Received: 0 items (0/sec) 2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent: 8747800 items | Received: 0 items (0/sec) 2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent: 9038400 items | Received: 0 items (0/sec) 2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent: 9317200 items | Received: 0 items (0/sec) ``` This is a dramatically better picture compared to the old approach. With 1 second interval memory limiter's impact on CPU is not measurable with the new approach, whereas with the old approach it was still showing several times higher CPU when limit was hit. This makes small check intervals practically useful and allows to effectively suppress incoming surges of data.

* Update versions file for 1.0.0 release * Prepare stable-v1 for version v1.0.0 * Update trace signal status in documentation * Update changelog * Update CHANGELOG.md Co-authored-by: Robert Pająk <pellared@hotmail.com> Co-authored-by: Robert Pająk <pellared@hotmail.com>

This change adds support to the Java Linux auto instrumentation configuration file for a new, optional parameter called generate_svc_name. If left unset, it defaults to false, which means that the .so will not generate a service name and add it to the environment of the soon-to-be running Java process. This differs from prior behavior where the .so did generate the service name, but generating the service name in the .so is no longer optimal because the Java instrumentation library itself will generate the service name if it hasn't been set.

…ry#2250)" (open-telemetry#2266) This reverts commit bba4914.

tigrannajaryan requested a review from a team December 4, 2020 01:50

tigrannajaryan force-pushed the feature/tigran/fix-memory-limiter branch from a486760 to 01d2fbe Compare December 4, 2020 01:52

tigrannajaryan requested review from jpkrohling and kbrockhoff December 4, 2020 02:04

jpkrohling reviewed Dec 4, 2020

View reviewed changes

bogdandrutu reviewed Dec 4, 2020

View reviewed changes

tigrannajaryan force-pushed the feature/tigran/fix-memory-limiter branch from 01d2fbe to 8a2bcb1 Compare December 9, 2020 15:54

tigrannajaryan force-pushed the feature/tigran/fix-memory-limiter branch from 8a2bcb1 to 7d91027 Compare December 9, 2020 20:35

tigrannajaryan requested review from jpkrohling and bogdandrutu December 9, 2020 20:39

jpkrohling approved these changes Dec 10, 2020

View reviewed changes

processor/memorylimiter/memorylimiter.go Show resolved Hide resolved

tigrannajaryan mentioned this pull request Dec 10, 2020

Add Splunk HEC trace exporter aws-observability/aws-otel-collector#182

Merged

jpkrohling approved these changes Dec 10, 2020

View reviewed changes

jrcamp approved these changes Dec 15, 2020

View reviewed changes

jrcamp reviewed Dec 15, 2020

View reviewed changes

processor/memorylimiter/memorylimiter.go Show resolved Hide resolved

tigrannajaryan force-pushed the feature/tigran/fix-memory-limiter branch from 7d91027 to 390b707 Compare December 15, 2020 17:06

bogdandrutu approved these changes Dec 15, 2020

View reviewed changes

processor/memorylimiter/memorylimiter.go Outdated Show resolved Hide resolved

tigrannajaryan force-pushed the feature/tigran/fix-memory-limiter branch from 390b707 to e438161 Compare December 15, 2020 17:32

tigrannajaryan merged commit 1c6d624 into open-telemetry:master Dec 15, 2020

tigrannajaryan deleted the feature/tigran/fix-memory-limiter branch December 15, 2020 17:56

tigrannajaryan mentioned this pull request Dec 16, 2020

OOM on unrecoverable errors even with memory_limiter configured #1121

Closed

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this pull request Apr 27, 2023

Revert "java zc: don't generate service name by default (open-telemet…

282b0db

…ry#2250)" (open-telemetry#2266) This reverts commit bba4914.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce soft and hard limits for memory limiter #2250

Introduce soft and hard limits for memory limiter #2250

tigrannajaryan commented Dec 4, 2020 •

edited

Loading

codecov bot commented Dec 4, 2020 •

edited

Loading

tigrannajaryan commented Dec 4, 2020

jpkrohling left a comment

tigrannajaryan commented Dec 9, 2020

tigrannajaryan commented Dec 9, 2020

jpkrohling left a comment

tigrannajaryan commented Dec 10, 2020

Introduce soft and hard limits for memory limiter #2250

Introduce soft and hard limits for memory limiter #2250

Conversation

tigrannajaryan commented Dec 4, 2020 • edited Loading

codecov bot commented Dec 4, 2020 • edited Loading

Codecov Report

tigrannajaryan commented Dec 4, 2020

jpkrohling left a comment

Choose a reason for hiding this comment

tigrannajaryan commented Dec 9, 2020

tigrannajaryan commented Dec 9, 2020

jpkrohling left a comment

Choose a reason for hiding this comment

tigrannajaryan commented Dec 10, 2020

tigrannajaryan commented Dec 4, 2020 •

edited

Loading

codecov bot commented Dec 4, 2020 •

edited

Loading