-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure reliable data delivery in erroneous situations #7460
Comments
FYI @open-telemetry/collector-approvers |
2 tasks
7 tasks
Re: Batch Exporter. Which path was decided on? If it's the quick fix is there a ticket I can follow? |
I believe the only issue remaining that would be required for 1.0 is #4335. I added that one to the Collector v1 project and I am going to remove this meta-issue from the Collector v1 project. |
mx-psi
removed
release:required-for-ga
Must be resolved before GA release
needs-review-from-stability-wg
labels
Sep 4, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have been looking at what the Collector does in erroneous situations, such as when the exporter's destination is temporarily unavailable and found a number of places in our codebase where we are probably not handling the error cases correctly, which may result in data losses when such data losses are preventable.
This is a summary task where I want to capture a list of things I think can be improved.
Receivers
Some (most?) receivers currently do not follow the Error Handling contract which says:
Here are open issues for some receivers:
If you discover more such receivers please add to the list above.
I would also like to add test helpers that allows us to easily verify the compliance of receivers to the contract. Here is a task to add such helper:
Another part of receiver contracts acknowledgement and checkpointing handling. This issue needs to be resolved to ensure that part of the contract is fulfilled:
Exporters
QueuedRetry exporter helper does not have
requeuingEnabled=true
when in-memory queue is used. This results in data being dropped after a few retries. We don't want this. We want to keep retrying, just like we do when persistent queue is enabled. Perhaps we make requeuingEnabled as user-visible option if we don't want to hard-code the requeuing behavior.Here is the list of tasks to resolve:
Batching
batchprocessor
currently drops the accumulated batch if the next consumer returns non-permanent error. This is not a correct behavior.We have 2 possible solutions:
batchprocessor
stops accepting new data and blocks the input when it is full and retries sending the accumulated batch to the next consumer if previous attempts return non-permanent errors.exporterhelper
#4646.End-to-end Testing
There are many places in our codebase where incorrect implementations and bugs may cause loss of data, especially in stressed and erroneous situations. To ensure the Collector works correctly we want to add end-to-end integration tests that verify the operation of the Collector as a whole:
Without having such test it is hard to be confident that we do not drop the data somewhere due to a bug or due to a component being incorrectly implemented. This should include testing when memory limit is hit.
Here is a task to implement this:
Clarify the Design
Here is the list of issues to resolve:
The text was updated successfully, but these errors were encountered: