-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Retry time in Exporter #7004
Comments
For what its worth, our exponential backoff algorithm's alignment with grpc's is a coincidence. Its now clear that for the purposes of OTLP, clients need to support "exponential backoff with jitter", but there's no common definition for what the algorithm is and what the parameters are. So in the absence of some standard we're required / encouraged to follow, we need to evaluate changes to the algorithm on the merits of the changes.
What's the expected jitter tolerance? The grpc proposal indicates that its +-.2, but why .2 instead of .3? |
Something similar discussion around the impl that we are using currently By definition if we see for exponential backoff the subsequent retries after initial backoff should wait more than the previous ones, but since we are using random it is like from 0 to calculated values, so retries can exhaust way too quick in some scenarios. If .2 jitter is debatable then maybe we can make this configurable or so |
I think I agree with making the logic of making the range And as you mention, if needed, can always add an additional parameter to make But what's weird to me is why they don't still put a bound on it corresponding to the maxBackoff. Like, why not The debate I'm having now is: is this issue with maxBackoff a good enough reason to diverge from grpc? Whether or not the algorithm is perfect, its nice to point to some sort of standard as the basis for our algorithm. |
+1
maybe it's good to have some jitter in the maxBackoff? e.g. in case the max retries continues even once it hits max backoff and so you don't end up in some kind of exactly every 30 seconds pinging 🤷♂ |
Describe the bug
After the initial wait, subsequent wait times are randomized within an upper bound, leading to sporadic behavior that deviates from the expected retry pattern.
Related discussion:
#3936 (comment)
Steps to reproduce
What did you expect to see?
The wait times should closely follow the calculated values, with a small and predictable jitter (e.g., 0.2).
What did you see instead?
The wait times are highly variable, appearing random and exceeding the expected jitter tolerance.
What version and what artifacts are you using?
All
Environment
Should be All
Additional context
For reference, see this related gRPC proposal:
grpc/proposal#452
The text was updated successfully, but these errors were encountered: