fix(subscriber): mitigate race in `Callsites::contains` #474

hds · 2023-10-11T10:46:53Z

The ConsoleLayer uses the Callsites struct to store and check for
callsites for specific kinds of traces, for example spawn spans or waker
events.

Callsites stores a fixed size array of pointers to the Metadata for
each callsite and a length to indicate how many callsites it has
registered. The length and each individual pointer are stored in
atomics.

Since it is possible for these values to change individually, if a
callsite lookup fails, we check if the length of the array has changed
while we were checking the pointers, if it has, the lookup is started
again.

However, there is still a possible race condition. If the length
changes, but the lookup occurs before the callsite pointer is actually
written, then we may miss a callsite that is in the process of being
registered. In this case, the pointer which is loaded from the
Callsites array will be null.

This change adds a check for this case (null ptr), and reperforms the
lookup if it occurs.

This race condition was found while chasing down the source of #473. It
doesn't solve the flakiness, but it can reduce the likelihood of it
occuring, thus it is a mitigation only.

In reality, neither of these race condition checks should be needed, as
we would expect that tracing guarantees that ConsoleLayer completes
register_callsite() before on_event() or new_span() are called.

The `ConsoleLayer` uses the `Callsites` struct to store and check for callsites for specific kinds of traces, for example spawn spans or waker events. `Callsites` stores a fixed size array of pointers to the `Metadata` for each callsite and a length to indicate how many callsites it has registered. The length and each individual pointer are stored in atomics. Since it is possible for these values to change individually, if a callsite lookup fails, we check if the length of the array has changed while we were checking the pointers, if it has, the lookup is started again. However, there is still a possible race condition. If the length changes, but the lookup occurs before the callsite pointer is actually written, then we may miss a callsite that is in the process of being registered. In this case, the pointer which is loaded from the `Callsites` array will be null. This change adds a check for this case (null ptr), and reperforms the lookup if it occurs. This race condition was found while chasing down the source of #473. It doesn't solve the flakiness, but it can reduce the likelihood of it occuring, thus it is a mitigation only. In reality, neither of these race condition checks should be needed, as we would expect that `tracing` guarantees that `ConsoleLayer` completes `register_callsite()` before `on_event()` or `new_span()` are called.

hawkw · 2023-10-25T20:32:55Z

console-subscriber/src/callsites.rs

                    return true;
+                } else if ptr::eq(recorded, ptr::null_mut()) {


style nit, take it or leave it: this could be

Suggested change

} else if ptr::eq(recorded, ptr::null_mut()) {

} else if recorded.is_null() {

hawkw · 2023-10-25T20:33:53Z

console-subscriber/src/callsites.rs

+                } else if ptr::eq(recorded, ptr::null_mut()) {
+                    // We have read a recorded callsite before it has been
+                    // written. We need to check again.
+                    continue;


hmm, this restarts the whole loop over again. should we, instead, have an inner loop for loading the specific array index until it's no longer null?

@hawkw After some thinking (and then forgetting about this), I beleive that this change doesn't make sense.

In fact, I'm not sure that the retry mechanism makes sense in general. If tracing gives the guarantee that Subscriber::register_callsite will be called (and finish!) before Subscriber::event get's called, then the retry shouldn't be necessary at all.

If it doesn't give that guarantee, then the retry is increasing our chances of finding the callsite we are interested in during a race condition, but I don't think that it actually solves the problem either.

For now I'm going to close this PR. After looking at tokio-rs/tracing#2743 a bit more I'll revisit.

hawkw · 2023-10-25T20:35:40Z

it would be really nice to have a test that reproduces this raciness. maybe we should add loom tests for Callsites to help catch this kind of issue...

hds · 2023-11-16T11:59:05Z

Closing this without merging as I don't think it actually makes sense. See #474 (comment) for more details.

hds requested a review from a team as a code owner October 11, 2023 10:46

hds added 4 commits October 19, 2023 15:57

Merge branch 'main' into hds/callsites-race

988daeb

Merge branch 'main' into hds/callsites-race

e3c1b56

Merge branch 'main' into hds/callsites-race

8111c70

Merge branch 'main' into hds/callsites-race

5ba6eaf

hawkw reviewed Oct 25, 2023

View reviewed changes

hds closed this Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subscriber): mitigate race in `Callsites::contains` #474

fix(subscriber): mitigate race in `Callsites::contains` #474

hds commented Oct 11, 2023

hawkw Oct 25, 2023

hawkw Oct 25, 2023

hds Nov 16, 2023

hawkw commented Oct 25, 2023

hds commented Nov 16, 2023

	} else if ptr::eq(recorded, ptr::null_mut()) {
	} else if recorded.is_null() {

fix(subscriber): mitigate race in Callsites::contains #474

fix(subscriber): mitigate race in Callsites::contains #474

Conversation

hds commented Oct 11, 2023

hawkw Oct 25, 2023

Choose a reason for hiding this comment

hawkw Oct 25, 2023

Choose a reason for hiding this comment

hds Nov 16, 2023

Choose a reason for hiding this comment

hawkw commented Oct 25, 2023

hds commented Nov 16, 2023

fix(subscriber): mitigate race in `Callsites::contains` #474

fix(subscriber): mitigate race in `Callsites::contains` #474