Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #5081
TL;DR:
There are flaws in the test.
EVENTS_MAP
. It should always be span ID.EXPECTED_TRACE_EVENTS_COUNT
constant says.This PR addresses all of these.
Details:
EVENTS_MAP
is accessed using inconsistent keys.The map holds spans. In some places the test code uses the trace ID as the key to the map (where the tracer adds entries, where the
e2e
test method removes the span it started explicitly) but in other places it uses the span ID (assertSpanChain
,findAndRemoveSpan
). These IDs are not interchangeable.The countdown latch is not initialized correctly.
In my local runs, 5 spans are reported to the two tracers--3 to the client tracer and 2 to the server tracer. The test uses a countdown latch to coordinate between the span reporter code in the tracers (which counts down the latch each time a span is reported) and the main test. But the latch is initialized to 4, not 5.
The ancestry check should start with the parent of the explicitly-created span (local variable
start
in the test), not that span itself.Some temporary debug output I added after each
put
to the map shows the tracer service name, the remaining count down latch value, the trace ID, the span ID, the parent span ID, and the span name for each of the 5 spans reported:Item 1 means that the map contains at most one entry because the trace ID is the same for all reported spans. After the spans have been reported to the tracers, the main test code removes that single entry. The later
assertSpanChain
can almost never fail because it almost never has any spans to look at. The "almost" is related to item 2.item 2 means that the main test thread is allowed to proceed any time after the latch count goes to 0, potentially before the last span has been reported and added to the map. I believe this is the race condition that triggers both types of intermittent failures described in the issue.
During most test runs, the fifth span is reported and added to the map before the main test thread reaches the code after the
EVENTS_LATCH.await
which uses theEVENTS_MAP
. Then the main test thread removes the single entry from the map, there are no entries to check the descendants of, and the test passes.But sometimes the main test code removes the one map entry, then the tracer records the fifth span and adds it to the map.
assertSpanChain
then the ancestry check fails. This is the failure Joe and I saw.assertSpanChain
has finished, then the remaining assertion in the test which checks for an empty map fails. This is the failure Daniel saw.As for item 3...
I changed the test code to resolve items 1 and 2. Here is more debug output from the rest of the test that checks descendants after I made those changes:
After the descendants check removes span 81f1c94dbeaa4c13 and recurses, the map is still not empty because of the first span reported to the tracers, the one with the null name that's the parent of the explicitly-started span. So the descendants check tries to find the parent which is not in the map, and the test fails.
Because the parent span of the one the test explicitly creates (the
start
variable) is reported to the tracer and is therefore added to the map, the main test should start the descendants check with the parent ofstart
instead of withstart
itself.