Remove the handling of retries by the SDK. #511

jkwatson · 2020-03-10T18:03:10Z

resolves #509

bogdandrutu · 2020-03-10T19:36:51Z

Oberon00 · 2020-03-12T12:48:29Z

Is the Success/Failure enum now actually still useful, or should we just let every language use its usual error handling (e.g. Result, exceptions, ...)

tigrannajaryan · 2020-03-12T15:14:01Z

I have some doubts about this.

First, I think we all agree that retries have to happen somewhere since temporary failures are a thing and we cannot afford to lose data on every random sending problem.

It appears the thinking is that the retries should be the responsibility of each exporter and SDK does not care about it. My experience with exporters in the Collector is that everybody has different ideas about how to handle the errors, what is an fatal error vs transient, whether to retry or not at all. There is no uniformity because exporters are implemented by different people with different mindset on this topic. The end result is that depending on what exporter you choose you can very different delivery guarantees and behavior.

Having the Retryable vs NotRetryable in the API at least forces the person who implements exporter to think about this topic. Absent this we will have to rely on guidelines for exporter authors to do the right thing and I am doubtful it will work well.

This also requires non-trivial code duplication in exporters. Each exporter has to implement the retrying logic, timeouts, backoff, etc. Here we have an opportunity for bugs since it is not trivial to implement this correctly (my hope was we could do this once in the SDK, which see more scrutiny and code reviews).

I would like to hear the arguments about why it is a good idea to move this functionality to the exporters.

One argument I see is that the retrying logic may very different depending on the protocol. Perhaps this is sufficient enough argument. Some protocols have the built in notion of retries and how backoff should be handled, some other don't.

For protocols like OTLP you still may need to implement large portion of retrying logic in the exporter otherwise you will not be compliant with the spec. This means you will always return either FailedNotRetryable or Success to the SDK, rendering the whole retrying logic in the SDK useless.

For other protocols which do not define the retrying logic in detail it likely would be sufficient to rely on SDK's built-in logic.

I am willing to be convinced that removing this from the SDK is the right approach, but would want to do something to make sure exporters implement it properly (at least produce a guidance document).

jkwatson · 2020-03-12T15:25:51Z

I think you have hit the nail on the head, @tigrannajaryan . Every protocol and backend has their own retry semantics. For New Relic, our SDK for writing metrics and spans already handles retries, based on how our APIs behave. It's fairly complex, and figuring out how to communicate that logic into the SDK for every possible protocol/backend seems like a fools-errand. Yes, New Relic can always just return SUCCESS (our exporters for java do the work on a background thread), but isn't every protocol/backend going to have its own notion of retries and how they should work?

tigrannajaryan · 2020-03-12T15:58:29Z

isn't every protocol/backend going to have its own notion of retries and how they should work?

Well, perhaps they should, but some don't. For example Jaeger Thrift exporter in Collector implementation simply returns an error for any HTTP status code >= 400 and let's the caller think about what to do with the error. It does not attempt to retry. As opposed to that Kinesis exporter in Collector implements retries on failures (not trying to blame anyone here, as a maintainer I take the responsibility for both).

My point is that with many contributors and especially because exporters are typically implemented by vendors there is likely going to be a very different take in each case.

I'd at least provide an easy way for exporters to report transient errors and have some handling in the SDK than have nothing at all.

I did not look into the SDK codebases of languages that we have today so I don't know if this is also a problem for SDK exporters. Perhaps it is just an unfortunate historical baggage of Collector codebase.

tsloughter · 2020-03-12T16:01:16Z

It could be that the SDK provides very basic expotential backoff up to some configurable limit and then drops them. And that this is only used if the exporter returns a "retry" value which is only returned if it does not implement retry itself.

So exporters are not forced to implement retry if they would be doing the same simple steps as the SDK, but if they do implement retry they must be properly configured to not return a "retry" response to the SDK.

Basically both worlds.

tigrannajaryan · 2020-03-12T17:23:16Z

@tsloughter yes, that makes sense to me.

jkwatson · 2020-03-12T17:40:25Z

Until we actually specify retry logic for the SDK to implement, I suggest we keep the enumerated return types (but just the 2 like in this PR).

When/if we specify retry logic for the SDK, we can add the FAILED_RETRYABLE back in here, along with documentation on what it means.

tigrannajaryan · 2020-03-13T15:47:14Z

Until we actually specify retry logic for the SDK to implement, I suggest we keep the enumerated return types (but just the 2 like in this PR).

When/if we specify retry logic for the SDK, we can add the FAILED_RETRYABLE back in here, along with documentation on what it means.

This sounds reasonable to me.

Oberon00 · 2020-03-16T18:26:31Z

I suggest we should have two interfaces: Exporter and RetryableExporter. Exporter returns nothing, RetryableExporter returns the current enumeration error code. Then we could have a RetryingExporter (which is itself a plain Exporter) that buffers and actually retries spans which need to be retried. The SpanProcessors operate on plain Exporters. In pseudo-Java:

interface Exporter { void export(List<SpanData> spans); }
interface RetryableExporter { StatusCode export(List<SpanData> spans); }
class RetryingExporter implements Exporter {
  public RetryingExporter(RetryableExporter exporter) { this.exporter = exporter; }
  @Override public void export(List<SpanData> spans) {
    /* Call exporter.export(), buffer failed spans. */
  }
}

We might want a more complex interval for RetryableExporter, such as ExportResult export(List<SpanData> spans, int numberOfRetriedSpansAtStart); with

class ExportResult {
   public List<SpanData>getSpansToBeRetried():
   public Optional<Integer> getRetryAfterMilliseconds();
}

If using the simpler interface, of course we can sacrifice a bit of type safety and just ditch Exporter, and rename RetryableExporter to Exporter. Then we are exactly at the situation we have currently, minus having an RetryingExporter. But that one can still be added later at any time.
EDIT: If we apply this PR, we can still usefully add a RetryableExporter interface later.

specification/sdk-tracing.md

Oberon00 · 2020-03-16T18:29:28Z

specification/sdk-tracing.md

@@ -375,8 +374,7 @@ type ExportResultCode int

 const (
    Success ExportResultCode = iota


I don't think that "iota" annotation is very accessible.

not sure what your comment means. I think this is idiomatic go for an enum.

The point I was trying to make is that not everyone who should understand the spec can read idiomatic go. In that case, I'd say "idiom" hits the nail on the head 😉

https://www.merriam-webster.com/dictionary/idiom

(2a) : the language peculiar to a people or to a district, community, or class

I think very few outside the go & math world know about the meaning of the word "iota". I know of people being puzzled by C++'s std::iota too, e.g. see https://stackoverflow.com/questions/9244879/what-does-iota-of-stdiota-stand-for

I'm assuming the "i" is increment and the "a" is assign, but I could not figure out or find the answer.

I think this is fair. If we want to remove/update this, I suggest a follow-on PR, since it's not directly related to my particular change.

This is an examples section, I think it is reasonable to use idiomatic language constructs in examples. I'd argue examples should actually do precisely that: follow the idioms of the particular language. We have a Java example below as well.

iota is a language keyword in Go and the way enums are supposed to be declared.

If we want more language examples so that everyone finds something familiar then let's add more examples. I don't see the point of making the code worse (and non-idiomatic code is worse IMHO) just because a reader may not be familiar with a particular language.

Oh, sorry, I did not see that context in the diff snippet! If this is a Go-specific example, then iota is completely fine and I retract my objection.

jkwatson · 2020-03-16T18:34:45Z

I suggest we should have two interfaces: Exporter and RetryableExporter.

I think that should be a discussion for another PR. In this one, I'm just trying to get to a point where we aren't saying things that we can't support. If we want to add retries into the default SDK, we should fully specify it, but outside of this particular PR.

Oberon00 · 2020-03-16T18:36:52Z

Agree that we should discuss this elsewhere. I approved your PR 👍

yurishkuro · 2020-03-17T23:47:37Z

specification/sdk-tracing.md

+  For protocol exporters this typically means that the data is sent over
+  the wire and delivered to the destination server.
+* `Failure` - exporting failed. For example, this
+  can happen when the batch contains bad data and cannot be serialized.


It is not clear to me why there should be a return code at all. What is the SDK supposed to do differently upon receiving a Failure code? This spec over-prescribes what the exporter should do, like the sentence above about "data sent over the wire", which may not be how exporter works at all, e.g. it might be spooling data to a file on disk. Meanwhile, the spec says nothing about what the SDK should do.

I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.

In Go we have this issue filed: open-telemetry/opentelemetry-go#174

There still needs to be an explanation of what the SDK is supposed to do with the return code. If the answer is "nothing", then there is no need for return code.

The return SHOULD be passed to the user's error handler, I think.

Maintain a metric (counter?) of failed/total exports? Log?

I'm happy either way. If, in the future, we want to add back FAILED_RETRYABLE, allowing exporters to be backward compatible would be good, IMO. Otherwise changing from a void return to an actual return value would be a breaking change for exporters.

I think we can say what we can do:

Record metrics

Inform user

Etc.

@yurishkuro I would like to have your feedback for this before merging.

@yurishkuro bump! Any last concerns? Otherwise we're going to merge this.

jmacd · 2020-03-18T16:01:38Z

specification/sdk-tracing.md

+  For protocol exporters this typically means that the data is sent over
+  the wire and delivered to the destination server.
+* `Failure` - exporting failed. For example, this
+  can happen when the batch contains bad data and cannot be serialized.


I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.

In Go we have this issue filed: open-telemetry/opentelemetry-go#174

bogdandrutu · 2020-04-01T03:04:55Z

@jkwatson please rebase.

jkwatson · 2020-04-01T15:14:44Z

@jkwatson please rebase.

done

bogdandrutu · 2020-04-08T17:14:52Z

Please rebase

jkwatson · 2020-04-08T17:29:41Z

rebase done

carlosalberto · 2020-04-13T22:01:11Z

We have enough approvals and no pending issues. We can merge after you rebase @jkwatson ;) (sadly I can't do it for you automatically :( ).

jkwatson · 2020-04-13T22:04:45Z

so many rebases!

arminru · 2020-04-14T11:46:37Z

so many rebases!

@jkwatson You might want to support motion #512. 🙂

jkwatson requested review from arminru, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners March 10, 2020 18:03

tsloughter approved these changes Mar 10, 2020

View reviewed changes

bogdandrutu approved these changes Mar 10, 2020

View reviewed changes

arminru approved these changes Mar 12, 2020

View reviewed changes

Oberon00 approved these changes Mar 16, 2020

View reviewed changes

yurishkuro reviewed Mar 17, 2020

View reviewed changes

Oberon00 approved these changes Mar 18, 2020

View reviewed changes

jmacd approved these changes Mar 18, 2020

View reviewed changes

pyohannes mentioned this pull request Mar 20, 2020

Add interfaces for exporter, processor and span data open-telemetry/opentelemetry-cpp#55

Merged

jkwatson force-pushed the remove_retries branch from dcf1ea4 to fb273a0 Compare April 1, 2020 15:13

mauriciovasquezbernal mentioned this pull request Apr 3, 2020

Improve export result handling in SDK open-telemetry/opentelemetry-python#538

Closed

jkwatson force-pushed the remove_retries branch from fb273a0 to 975de4b Compare April 7, 2020 15:03

tedsuo approved these changes Apr 7, 2020

View reviewed changes

carlosalberto approved these changes Apr 7, 2020

View reviewed changes

jkwatson force-pushed the remove_retries branch from 975de4b to f9fe3bd Compare April 8, 2020 17:19

yurishkuro approved these changes Apr 11, 2020

View reviewed changes

c24t approved these changes Apr 13, 2020

View reviewed changes

Remove the handling of retries from exporters.

f09bbb9

jkwatson force-pushed the remove_retries branch from f9fe3bd to f09bbb9 Compare April 13, 2020 22:04

carlosalberto merged commit e0bd417 into open-telemetry:master Apr 13, 2020

jkwatson mentioned this pull request Apr 14, 2020

Remove the RETRYABLE failure mode from the exporter interfaces. open-telemetry/opentelemetry-java#1112

Merged

hectorhdzg mentioned this pull request Apr 16, 2020

Exporter retry logic should not be the responsibility of the SDK open-telemetry/opentelemetry-python#590

Merged

fbogsany mentioned this pull request Jun 2, 2020

Remove handling of retries by the SDK open-telemetry/opentelemetry-ruby#268

Closed

carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024

Remove the handling of retries from exporters. (open-telemetry#511)

5dfa0ab

		@@ -375,8 +374,7 @@ type ExportResultCode int

		const (
		Success ExportResultCode = iota

Remove the handling of retries by the SDK. #511

Remove the handling of retries by the SDK. #511

Conversation

jkwatson commented Mar 10, 2020

bogdandrutu commented Mar 10, 2020

Oberon00 commented Mar 12, 2020

tigrannajaryan commented Mar 12, 2020

jkwatson commented Mar 12, 2020

tigrannajaryan commented Mar 12, 2020

tsloughter commented Mar 12, 2020

tigrannajaryan commented Mar 12, 2020

jkwatson commented Mar 12, 2020

tigrannajaryan commented Mar 13, 2020

Oberon00 commented Mar 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Oberon00 Mar 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkwatson commented Mar 16, 2020

Oberon00 commented Mar 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdandrutu commented Apr 1, 2020

jkwatson commented Apr 1, 2020

bogdandrutu commented Apr 8, 2020

jkwatson commented Apr 8, 2020

carlosalberto commented Apr 13, 2020 • edited Loading

jkwatson commented Apr 13, 2020

arminru commented Apr 14, 2020

Oberon00 commented Mar 16, 2020 •

edited

Loading

Oberon00 Mar 17, 2020 •

edited

Loading

carlosalberto commented Apr 13, 2020 •

edited

Loading