-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SpanData validation failed for validator org.opensearch.test.telemetry.tracing.validators.AllSpansAreEndedProperly flakyness #11310
Comments
@Gaganjuneja FYI |
I will take a look. Thanks! |
It does not look like the test in itself is the issue. In my experiments, I can observe that the moment I turn of the update mappings from the test (by keeping the fields constant in the source), the tests starts to pass consistently. This seems to be a very corner case, where we are unable to close the span in cases of mapping update during node restarts. The span stays in the memory without any execution thread associated to it. There would be two options that we can follow from here on:
|
We can do that but it's very hard to make sure that its being used cautiously and not being exploited to pass the test reporting the underneath memory leak issues. @reta your thoughts here? |
Thanks @sarthakaggarwal97
This is concerning behaviour: to me, such edge cases lead to the failure flows that are not instrumented. It would be great to understand why the restart (which basically should be clean stop / start sequence) keeps some components in "running" state (not aware of restart). We should not be looking for shortcuts here. |
Let me elaborate the issue with two cases of spans, one that was successful, and the one that failed. In both of these cases, the mappings update was required. Would like to get your inputs on how we can proceed on this. Happy CaseLet's consider upon active indexing, a new span
It realises that we need to update the mappings
Reaches the onResponse of Mapping Updates
Post mappings update, executor is attached, and the executeBulkItemRequest is continued
Sad CaseFollowing the span based on the exception
Upon active indexing, a new span
It realizes that we need to update the mappings
Before the span can come into the Mappings Update Response, the node is restarted
Then this span never comes into the Mappings Update Response. However, the parent span ends post restart
Attaching the logs: op.log |
With this, I've a couple of questions as well:
|
Yes, parent span can close before the children in the async scenarios.
Span stays in the memory till we call the endSpan. Definitely, in case of node stop (opensearch process stop) spans will be cleared. But the ideal state is where on the graceful shutdown we should close all the open spans. This is also a validation that we know how to react to the node stop/start from the outstanding indexing requests and we should have all the open requests/tasks/actions closed gracefully and release all the resources. |
Ack!
Correct, but here's an interesting case. The mapping update request R1 comes to cluster-manager node T1, and the node restarts. With this, or in cases where the node T1 fails to join the cluster, the data node will retry the request to the new cluster-manager node T2. Unless there is a way to gracefully drain the spans in case of node restarts in the tests, i.e remove from the |
@sarthakaggarwal97 thanks for the detailed flow
Aha, so basically what we have here:
I will try to play with your reproducer to understand where we have a gap, thank you. |
Describe the bug
It seems like we have some failure scenarios that are not covered by tracing instrumentation framework, randomnly failing the validation checks:
To Reproduce
Expected behavior
The checks must always pass
Plugins
Standard
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
The text was updated successfully, but these errors were encountered: