Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backoffSleep() so it jitters the sleep time. #505

Closed
wants to merge 2 commits into from

Conversation

adam-mateen
Copy link
Contributor

Description of the issue

When a burst of metrics is forwards to the output plugin and the number of metrics exceeds the internal buffer size, CWA will push the metrics immediately. If there are many hosts running CWA and they all get a burst of metrics at the beginning of each minute, then they will all try uploading at the beginning of the minute. The burst of uploads is likely to surpass API rate limits, and will need to be retried.

Currently when PutMetricData API calls fail, they are retried after deterministic delays instead of jittered.

Description of changes

Fix backoffSleep() so it jitters the sleep time.
Fix publish() so it does not use a negative sleep time.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  • Updated unit tests.
  • Manually tested a burst of 10K, 20K, 30K statsd metrics.

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make linter

(skipped formatting to keep PR small)

Fix publish() so it does not use a negative sleep time.
@adam-mateen adam-mateen requested a review from a team as a code owner July 7, 2022 21:32
@adam-mateen adam-mateen marked this pull request as draft July 7, 2022 21:33
// Max wait time is 1 minute (jittered)
d := 1 * time.Minute
if n < 5 {
d = base * time.Duration(1<<int64(n))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change in behavior.
Just a formatting change.

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2022

Codecov Report

Merging #505 (78864f7) into master (1f9077f) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
+ Coverage   56.71%   56.75%   +0.03%     
==========================================
  Files         374      374              
  Lines       17733    17736       +3     
==========================================
+ Hits        10058    10066       +8     
+ Misses       7083     7079       -4     
+ Partials      592      591       -1     
Impacted Files Coverage Δ
plugins/outputs/cloudwatch/cloudwatch.go 74.37% <100.00%> (+0.29%) ⬆️
plugins/outputs/cloudwatch/util.go 85.96% <100.00%> (-0.48%) ⬇️
plugins/outputs/cloudwatchlogs/pusher.go 89.28% <100.00%> (+0.97%) ⬆️
plugins/inputs/demo/demo.go 57.14% <0.00%> (+7.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f9077f...78864f7. Read the comment docs.

log.Printf("Got publisherJitter %v", publishJitter)
assert.GreaterOrEqual(t, publishJitter, time.Duration(0))
assert.Less(t, publishJitter, time.Minute)
assert.NotEqual(t, publishJitter, last)
Copy link
Contributor Author

@adam-mateen adam-mateen Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is effectively no chance that 2 random numbers can be the same considering the range is 0 to 60,000,000,000.
If someone is opposed I can remove the NotEqual check.

assert.True(t, svc.AssertNumberOfCalls(t, "PutMetricData", 5))

cloudWatchOutput.Close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real change in TestWriteError().
The type of backoffRetryBase was changed elsewhere.
The extra call to cloudWatchOutput.Close() is just for cleanup.

@adam-mateen adam-mateen marked this pull request as ready for review July 7, 2022 22:52
@adam-mateen adam-mateen closed this Jul 7, 2022
@adam-mateen
Copy link
Contributor Author

Closed this PR since it is just a subset of this larger PR:
#501
Which will be merged shortly.

@adam-mateen adam-mateen deleted the fix_jitter_on_retries branch July 8, 2022 14:49
dchappa pushed a commit that referenced this pull request Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants