Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the handling of retries by the SDK. #511

Merged
merged 1 commit into from
Apr 13, 2020

Conversation

jkwatson
Copy link
Contributor

resolves #509

@bogdandrutu
Copy link
Member

@tigrannajaryan are you ok with this?

@Oberon00
Copy link
Member

Is the Success/Failure enum now actually still useful, or should we just let every language use its usual error handling (e.g. Result, exceptions, ...)

@tigrannajaryan
Copy link
Member

I have some doubts about this.

First, I think we all agree that retries have to happen somewhere since temporary failures are a thing and we cannot afford to lose data on every random sending problem.

It appears the thinking is that the retries should be the responsibility of each exporter and SDK does not care about it. My experience with exporters in the Collector is that everybody has different ideas about how to handle the errors, what is an fatal error vs transient, whether to retry or not at all. There is no uniformity because exporters are implemented by different people with different mindset on this topic. The end result is that depending on what exporter you choose you can very different delivery guarantees and behavior.

Having the Retryable vs NotRetryable in the API at least forces the person who implements exporter to think about this topic. Absent this we will have to rely on guidelines for exporter authors to do the right thing and I am doubtful it will work well.

This also requires non-trivial code duplication in exporters. Each exporter has to implement the retrying logic, timeouts, backoff, etc. Here we have an opportunity for bugs since it is not trivial to implement this correctly (my hope was we could do this once in the SDK, which see more scrutiny and code reviews).

I would like to hear the arguments about why it is a good idea to move this functionality to the exporters.

One argument I see is that the retrying logic may very different depending on the protocol. Perhaps this is sufficient enough argument. Some protocols have the built in notion of retries and how backoff should be handled, some other don't.

For protocols like OTLP you still may need to implement large portion of retrying logic in the exporter otherwise you will not be compliant with the spec. This means you will always return either FailedNotRetryable or Success to the SDK, rendering the whole retrying logic in the SDK useless.

For other protocols which do not define the retrying logic in detail it likely would be sufficient to rely on SDK's built-in logic.

I am willing to be convinced that removing this from the SDK is the right approach, but would want to do something to make sure exporters implement it properly (at least produce a guidance document).

@jkwatson
Copy link
Contributor Author

I think you have hit the nail on the head, @tigrannajaryan . Every protocol and backend has their own retry semantics. For New Relic, our SDK for writing metrics and spans already handles retries, based on how our APIs behave. It's fairly complex, and figuring out how to communicate that logic into the SDK for every possible protocol/backend seems like a fools-errand. Yes, New Relic can always just return SUCCESS (our exporters for java do the work on a background thread), but isn't every protocol/backend going to have its own notion of retries and how they should work?

@tigrannajaryan
Copy link
Member

isn't every protocol/backend going to have its own notion of retries and how they should work?

Well, perhaps they should, but some don't. For example Jaeger Thrift exporter in Collector implementation simply returns an error for any HTTP status code >= 400 and let's the caller think about what to do with the error. It does not attempt to retry. As opposed to that Kinesis exporter in Collector implements retries on failures (not trying to blame anyone here, as a maintainer I take the responsibility for both).

My point is that with many contributors and especially because exporters are typically implemented by vendors there is likely going to be a very different take in each case.

I'd at least provide an easy way for exporters to report transient errors and have some handling in the SDK than have nothing at all.

I did not look into the SDK codebases of languages that we have today so I don't know if this is also a problem for SDK exporters. Perhaps it is just an unfortunate historical baggage of Collector codebase.

@tsloughter
Copy link
Member

It could be that the SDK provides very basic expotential backoff up to some configurable limit and then drops them. And that this is only used if the exporter returns a "retry" value which is only returned if it does not implement retry itself.

So exporters are not forced to implement retry if they would be doing the same simple steps as the SDK, but if they do implement retry they must be properly configured to not return a "retry" response to the SDK.

Basically both worlds.

@tigrannajaryan
Copy link
Member

@tsloughter yes, that makes sense to me.

@jkwatson
Copy link
Contributor Author

Until we actually specify retry logic for the SDK to implement, I suggest we keep the enumerated return types (but just the 2 like in this PR).

When/if we specify retry logic for the SDK, we can add the FAILED_RETRYABLE back in here, along with documentation on what it means.

@tigrannajaryan
Copy link
Member

Until we actually specify retry logic for the SDK to implement, I suggest we keep the enumerated return types (but just the 2 like in this PR).

When/if we specify retry logic for the SDK, we can add the FAILED_RETRYABLE back in here, along with documentation on what it means.

This sounds reasonable to me.

@Oberon00
Copy link
Member

Oberon00 commented Mar 16, 2020

I suggest we should have two interfaces: Exporter and RetryableExporter. Exporter returns nothing, RetryableExporter returns the current enumeration error code. Then we could have a RetryingExporter (which is itself a plain Exporter) that buffers and actually retries spans which need to be retried. The SpanProcessors operate on plain Exporters. In pseudo-Java:

interface Exporter { void export(List<SpanData> spans); }
interface RetryableExporter { StatusCode export(List<SpanData> spans); }
class RetryingExporter implements Exporter {
  public RetryingExporter(RetryableExporter exporter) { this.exporter = exporter; }
  @Override public void export(List<SpanData> spans) {
    /* Call exporter.export(), buffer failed spans. */
  }
}

We might want a more complex interval for RetryableExporter, such as ExportResult export(List<SpanData> spans, int numberOfRetriedSpansAtStart); with

class ExportResult {
   public List<SpanData>getSpansToBeRetried():
   public Optional<Integer> getRetryAfterMilliseconds();
}

If using the simpler interface, of course we can sacrifice a bit of type safety and just ditch Exporter, and rename RetryableExporter to Exporter. Then we are exactly at the situation we have currently, minus having an RetryingExporter. But that one can still be added later at any time.
EDIT: If we apply this PR, we can still usefully add a RetryableExporter interface later.

specification/sdk-tracing.md Outdated Show resolved Hide resolved
@@ -375,8 +374,7 @@ type ExportResultCode int

const (
Success ExportResultCode = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that "iota" annotation is very accessible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what your comment means. I think this is idiomatic go for an enum.

Copy link
Member

@Oberon00 Oberon00 Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I was trying to make is that not everyone who should understand the spec can read idiomatic go. In that case, I'd say "idiom" hits the nail on the head 😉

https://www.merriam-webster.com/dictionary/idiom

(2a) : the language peculiar to a people or to a district, community, or class

I think very few outside the go & math world know about the meaning of the word "iota". I know of people being puzzled by C++'s std::iota too, e.g. see https://stackoverflow.com/questions/9244879/what-does-iota-of-stdiota-stand-for

I'm assuming the "i" is increment and the "a" is assign, but I could not figure out or find the answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fair. If we want to remove/update this, I suggest a follow-on PR, since it's not directly related to my particular change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an examples section, I think it is reasonable to use idiomatic language constructs in examples. I'd argue examples should actually do precisely that: follow the idioms of the particular language. We have a Java example below as well.

iota is a language keyword in Go and the way enums are supposed to be declared.

If we want more language examples so that everyone finds something familiar then let's add more examples. I don't see the point of making the code worse (and non-idiomatic code is worse IMHO) just because a reader may not be familiar with a particular language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I did not see that context in the diff snippet! If this is a Go-specific example, then iota is completely fine and I retract my objection.

@jkwatson
Copy link
Contributor Author

I suggest we should have two interfaces: Exporter and RetryableExporter.

I think that should be a discussion for another PR. In this one, I'm just trying to get to a point where we aren't saying things that we can't support. If we want to add retries into the default SDK, we should fully specify it, but outside of this particular PR.

@Oberon00
Copy link
Member

Agree that we should discuss this elsewhere. I approved your PR 👍

For protocol exporters this typically means that the data is sent over
the wire and delivered to the destination server.
* `Failure` - exporting failed. For example, this
can happen when the batch contains bad data and cannot be serialized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me why there should be a return code at all. What is the SDK supposed to do differently upon receiving a Failure code? This spec over-prescribes what the exporter should do, like the sentence above about "data sent over the wire", which may not be how exporter works at all, e.g. it might be spooling data to a file on disk. Meanwhile, the spec says nothing about what the SDK should do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.

In Go we have this issue filed: open-telemetry/opentelemetry-go#174

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still needs to be an explanation of what the SDK is supposed to do with the return code. If the answer is "nothing", then there is no need for return code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return SHOULD be passed to the user's error handler, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintain a metric (counter?) of failed/total exports? Log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy either way. If, in the future, we want to add back FAILED_RETRYABLE, allowing exporters to be backward compatible would be good, IMO. Otherwise changing from a void return to an actual return value would be a breaking change for exporters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can say what we can do:

  1. Record metrics
  2. Inform user
  3. Etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro I would like to have your feedback for this before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro bump! Any last concerns? Otherwise we're going to merge this.

For protocol exporters this typically means that the data is sent over
the wire and delivered to the destination server.
* `Failure` - exporting failed. For example, this
can happen when the batch contains bad data and cannot be serialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of this as a matter of convenience for the exporter. The SDK should have an error handling mechanism, so that it's not the exporters responsibility to report errors when they occur. IOW the SDK is not required to do anything when an error occurs, but it SHOULD have a way for the user to gain knowledge of the failure.

In Go we have this issue filed: open-telemetry/opentelemetry-go#174

@bogdandrutu
Copy link
Member

@jkwatson please rebase.

@jkwatson
Copy link
Contributor Author

jkwatson commented Apr 1, 2020

@jkwatson please rebase.

done

@bogdandrutu
Copy link
Member

Please rebase

@jkwatson
Copy link
Contributor Author

jkwatson commented Apr 8, 2020

rebase done

@carlosalberto
Copy link
Contributor

carlosalberto commented Apr 13, 2020

We have enough approvals and no pending issues. We can merge after you rebase @jkwatson ;) (sadly I can't do it for you automatically :( ).

@jkwatson
Copy link
Contributor Author

so many rebases!

@carlosalberto carlosalberto merged commit e0bd417 into open-telemetry:master Apr 13, 2020
@arminru
Copy link
Member

arminru commented Apr 14, 2020

so many rebases!

@jkwatson You might want to support motion #512. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exporter retry logic should not be the responsibility of the SDK