-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the handling of retries by the SDK. #511
Remove the handling of retries by the SDK. #511
Conversation
@tigrannajaryan are you ok with this? |
Is the Success/Failure enum now actually still useful, or should we just let every language use its usual error handling (e.g. Result, exceptions, ...) |
I have some doubts about this. First, I think we all agree that retries have to happen somewhere since temporary failures are a thing and we cannot afford to lose data on every random sending problem. It appears the thinking is that the retries should be the responsibility of each exporter and SDK does not care about it. My experience with exporters in the Collector is that everybody has different ideas about how to handle the errors, what is an fatal error vs transient, whether to retry or not at all. There is no uniformity because exporters are implemented by different people with different mindset on this topic. The end result is that depending on what exporter you choose you can very different delivery guarantees and behavior. Having the Retryable vs NotRetryable in the API at least forces the person who implements exporter to think about this topic. Absent this we will have to rely on guidelines for exporter authors to do the right thing and I am doubtful it will work well. This also requires non-trivial code duplication in exporters. Each exporter has to implement the retrying logic, timeouts, backoff, etc. Here we have an opportunity for bugs since it is not trivial to implement this correctly (my hope was we could do this once in the SDK, which see more scrutiny and code reviews). I would like to hear the arguments about why it is a good idea to move this functionality to the exporters. One argument I see is that the retrying logic may very different depending on the protocol. Perhaps this is sufficient enough argument. Some protocols have the built in notion of retries and how backoff should be handled, some other don't. For protocols like OTLP you still may need to implement large portion of retrying logic in the exporter otherwise you will not be compliant with the spec. This means you will always return either FailedNotRetryable or Success to the SDK, rendering the whole retrying logic in the SDK useless. For other protocols which do not define the retrying logic in detail it likely would be sufficient to rely on SDK's built-in logic. I am willing to be convinced that removing this from the SDK is the right approach, but would want to do something to make sure exporters implement it properly (at least produce a guidance document). |
I think you have hit the nail on the head, @tigrannajaryan . Every protocol and backend has their own retry semantics. For New Relic, our SDK for writing metrics and spans already handles retries, based on how our APIs behave. It's fairly complex, and figuring out how to communicate that logic into the SDK for every possible protocol/backend seems like a fools-errand. Yes, New Relic can always just return SUCCESS (our exporters for java do the work on a background thread), but isn't every protocol/backend going to have its own notion of retries and how they should work? |
Well, perhaps they should, but some don't. For example Jaeger Thrift exporter in Collector implementation simply returns an error for any HTTP status code >= 400 and let's the caller think about what to do with the error. It does not attempt to retry. As opposed to that Kinesis exporter in Collector implements retries on failures (not trying to blame anyone here, as a maintainer I take the responsibility for both). My point is that with many contributors and especially because exporters are typically implemented by vendors there is likely going to be a very different take in each case. I'd at least provide an easy way for exporters to report transient errors and have some handling in the SDK than have nothing at all. I did not look into the SDK codebases of languages that we have today so I don't know if this is also a problem for SDK exporters. Perhaps it is just an unfortunate historical baggage of Collector codebase. |
It could be that the SDK provides very basic expotential backoff up to some configurable limit and then drops them. And that this is only used if the exporter returns a "retry" value which is only returned if it does not implement retry itself. So exporters are not forced to implement retry if they would be doing the same simple steps as the SDK, but if they do implement retry they must be properly configured to not return a "retry" response to the SDK. Basically both worlds. |
@tsloughter yes, that makes sense to me. |
Until we actually specify retry logic for the SDK to implement, I suggest we keep the enumerated return types (but just the 2 like in this PR). When/if we specify retry logic for the SDK, we can add the FAILED_RETRYABLE back in here, along with documentation on what it means. |
This sounds reasonable to me. |
I suggest we should have two interfaces: Exporter and RetryableExporter. Exporter returns nothing, RetryableExporter returns the current enumeration error code. Then we could have a RetryingExporter (which is itself a plain Exporter) that buffers and actually retries spans which need to be retried. The SpanProcessors operate on plain Exporters. In pseudo-Java: interface Exporter { void export(List<SpanData> spans); }
interface RetryableExporter { StatusCode export(List<SpanData> spans); }
class RetryingExporter implements Exporter {
public RetryingExporter(RetryableExporter exporter) { this.exporter = exporter; }
@Override public void export(List<SpanData> spans) {
/* Call exporter.export(), buffer failed spans. */
}
} We might want a more complex interval for RetryableExporter, such as class ExportResult {
public List<SpanData>getSpansToBeRetried():
public Optional<Integer> getRetryAfterMilliseconds();
} If using the simpler interface, of course we can sacrifice a bit of type safety and just ditch Exporter, and rename RetryableExporter to Exporter. Then we are exactly at the situation we have currently, minus having an RetryingExporter. But that one can still be added later at any time. |
specification/sdk-tracing.md
Outdated
@@ -375,8 +374,7 @@ type ExportResultCode int | |||
|
|||
const ( | |||
Success ExportResultCode = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that "iota" annotation is very accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what your comment means. I think this is idiomatic go for an enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I was trying to make is that not everyone who should understand the spec can read idiomatic go. In that case, I'd say "idiom" hits the nail on the head 😉
https://www.merriam-webster.com/dictionary/idiom
(2a) : the language peculiar to a people or to a district, community, or class
I think very few outside the go & math world know about the meaning of the word "iota". I know of people being puzzled by C++'s std::iota
too, e.g. see https://stackoverflow.com/questions/9244879/what-does-iota-of-stdiota-stand-for
I'm assuming the "i" is increment and the "a" is assign, but I could not figure out or find the answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fair. If we want to remove/update this, I suggest a follow-on PR, since it's not directly related to my particular change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an examples section, I think it is reasonable to use idiomatic language constructs in examples. I'd argue examples should actually do precisely that: follow the idioms of the particular language. We have a Java example below as well.
iota
is a language keyword in Go and the way enums are supposed to be declared.
If we want more language examples so that everyone finds something familiar then let's add more examples. I don't see the point of making the code worse (and non-idiomatic code is worse IMHO) just because a reader may not be familiar with a particular language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I did not see that context in the diff snippet! If this is a Go-specific example, then iota is completely fine and I retract my objection.
I think that should be a discussion for another PR. In this one, I'm just trying to get to a point where we aren't saying things that we can't support. If we want to add retries into the default SDK, we should fully specify it, but outside of this particular PR. |
Agree that we should discuss this elsewhere. I approved your PR 👍 |
specification/sdk-tracing.md
Outdated
For protocol exporters this typically means that the data is sent over | ||
the wire and delivered to the destination server. | ||
* `Failure` - exporting failed. For example, this | ||
can happen when the batch contains bad data and cannot be serialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me why there should be a return code at all. What is the SDK supposed to do differently upon receiving a Failure
code? This spec over-prescribes what the exporter should do, like the sentence above about "data sent over the wire", which may not be how exporter works at all, e.g. it might be spooling data to a file on disk. Meanwhile, the spec says nothing about what the SDK should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.
In Go we have this issue filed: open-telemetry/opentelemetry-go#174
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still needs to be an explanation of what the SDK is supposed to do with the return code. If the answer is "nothing", then there is no need for return code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return SHOULD be passed to the user's error handler, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintain a metric (counter?) of failed/total exports? Log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy either way. If, in the future, we want to add back FAILED_RETRYABLE, allowing exporters to be backward compatible would be good, IMO. Otherwise changing from a void return to an actual return value would be a breaking change for exporters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can say what we can do:
- Record metrics
- Inform user
- Etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro I would like to have your feedback for this before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro bump! Any last concerns? Otherwise we're going to merge this.
specification/sdk-tracing.md
Outdated
For protocol exporters this typically means that the data is sent over | ||
the wire and delivered to the destination server. | ||
* `Failure` - exporting failed. For example, this | ||
can happen when the batch contains bad data and cannot be serialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.
In Go we have this issue filed: open-telemetry/opentelemetry-go#174
@jkwatson please rebase. |
done |
Please rebase |
rebase done |
We have enough approvals and no pending issues. We can merge after you rebase @jkwatson ;) (sadly I can't do it for you automatically :( ). |
f9fe3bd
to
f09bbb9
Compare
so many rebases! |
resolves #509