Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure reliable data delivery in erroneous situations #7460

Open
tigrannajaryan opened this issue Mar 31, 2023 · 3 comments
Open

Ensure reliable data delivery in erroneous situations #7460

tigrannajaryan opened this issue Mar 31, 2023 · 3 comments

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Mar 31, 2023

I have been looking at what the Collector does in erroneous situations, such as when the exporter's destination is temporarily unavailable and found a number of places in our codebase where we are probably not handling the error cases correctly, which may result in data losses when such data losses are preventable.

This is a summary task where I want to capture a list of things I think can be improved.

Receivers

Some (most?) receivers currently do not follow the Error Handling contract which says:

// If the error is non-Permanent then the nextConsumer.Consume*() call should be retried
// with the same data.

Here are open issues for some receivers:

If you discover more such receivers please add to the list above.

I would also like to add test helpers that allows us to easily verify the compliance of receivers to the contract. Here is a task to add such helper:

Another part of receiver contracts acknowledgement and checkpointing handling. This issue needs to be resolved to ensure that part of the contract is fulfilled:

Exporters

QueuedRetry exporter helper does not have requeuingEnabled=true when in-memory queue is used. This results in data being dropped after a few retries. We don't want this. We want to keep retrying, just like we do when persistent queue is enabled. Perhaps we make requeuingEnabled as user-visible option if we don't want to hard-code the requeuing behavior.

Here is the list of tasks to resolve:

Batching

batchprocessor currently drops the accumulated batch if the next consumer returns non-permanent error. This is not a correct behavior.

We have 2 possible solutions:

  • Quick fix: make sure batchprocessor stops accepting new data and blocks the input when it is full and retries sending the accumulated batch to the next consumer if previous attempts return non-permanent errors.
  • Longer fix: eliminate batch processor and move batching functionality to exporter helper, see proposal here:

End-to-end Testing

There are many places in our codebase where incorrect implementations and bugs may cause loss of data, especially in stressed and erroneous situations. To ensure the Collector works correctly we want to add end-to-end integration tests that verify the operation of the Collector as a whole:

  • In typical recommended configuration (memorylimiter and batch processors)
  • In memory limited mode and when destination is unavailable.

Without having such test it is hard to be confident that we do not drop the data somewhere due to a bug or due to a component being incorrectly implemented. This should include testing when memory limit is hit.

Here is a task to implement this:

Clarify the Design

Here is the list of issues to resolve:

@tigrannajaryan
Copy link
Member Author

FYI @open-telemetry/collector-approvers

@bgrouxupgrade
Copy link

Re: Batch Exporter. Which path was decided on? If it's the quick fix is there a ticket I can follow?

@mx-psi
Copy link
Member

mx-psi commented Sep 4, 2024

I believe the only issue remaining that would be required for 1.0 is #4335. I added that one to the Collector v1 project and I am going to remove this meta-issue from the Collector v1 project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants