Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(subscriber): mitigate race in Callsites::contains #474

Closed
wants to merge 5 commits into from

Conversation

hds
Copy link
Collaborator

@hds hds commented Oct 11, 2023

The ConsoleLayer uses the Callsites struct to store and check for
callsites for specific kinds of traces, for example spawn spans or waker
events.

Callsites stores a fixed size array of pointers to the Metadata for
each callsite and a length to indicate how many callsites it has
registered. The length and each individual pointer are stored in
atomics.

Since it is possible for these values to change individually, if a
callsite lookup fails, we check if the length of the array has changed
while we were checking the pointers, if it has, the lookup is started
again.

However, there is still a possible race condition. If the length
changes, but the lookup occurs before the callsite pointer is actually
written, then we may miss a callsite that is in the process of being
registered. In this case, the pointer which is loaded from the
Callsites array will be null.

This change adds a check for this case (null ptr), and reperforms the
lookup if it occurs.

This race condition was found while chasing down the source of #473. It
doesn't solve the flakiness, but it can reduce the likelihood of it
occuring, thus it is a mitigation only.

In reality, neither of these race condition checks should be needed, as
we would expect that tracing guarantees that ConsoleLayer completes
register_callsite() before on_event() or new_span() are called.

The `ConsoleLayer` uses the `Callsites` struct to store and check for
callsites for specific kinds of traces, for example spawn spans or waker
events.

`Callsites` stores a fixed size array of pointers to the `Metadata` for
each callsite and a length to indicate how many callsites it has
registered. The length and each individual pointer are stored in
atomics.

Since it is possible for these values to change individually, if a
callsite lookup fails, we check if the length of the array has changed
while we were checking the pointers, if it has, the lookup is started
again.

However, there is still a possible race condition. If the length
changes, but the lookup occurs before the callsite pointer is actually
written, then we may miss a callsite that is in the process of being
registered. In this case, the pointer which is loaded from the
`Callsites` array will be null.

This change adds a check for this case (null ptr), and reperforms the
lookup if it occurs.

This race condition was found while chasing down the source of #473. It
doesn't solve the flakiness, but it can reduce the likelihood of it
occuring, thus it is a mitigation only.

In reality, neither of these race condition checks should be needed, as
we would expect that `tracing` guarantees that `ConsoleLayer` completes
`register_callsite()` before `on_event()` or `new_span()` are called.
@hds hds requested a review from a team as a code owner October 11, 2023 10:46
return true;
} else if ptr::eq(recorded, ptr::null_mut()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit, take it or leave it: this could be

Suggested change
} else if ptr::eq(recorded, ptr::null_mut()) {
} else if recorded.is_null() {

} else if ptr::eq(recorded, ptr::null_mut()) {
// We have read a recorded callsite before it has been
// written. We need to check again.
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this restarts the whole loop over again. should we, instead, have an inner loop for loading the specific array index until it's no longer null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hawkw After some thinking (and then forgetting about this), I beleive that this change doesn't make sense.

In fact, I'm not sure that the retry mechanism makes sense in general. If tracing gives the guarantee that Subscriber::register_callsite will be called (and finish!) before Subscriber::event get's called, then the retry shouldn't be necessary at all.

If it doesn't give that guarantee, then the retry is increasing our chances of finding the callsite we are interested in during a race condition, but I don't think that it actually solves the problem either.

For now I'm going to close this PR. After looking at tokio-rs/tracing#2743 a bit more I'll revisit.

@hawkw
Copy link
Member

hawkw commented Oct 25, 2023

it would be really nice to have a test that reproduces this raciness. maybe we should add loom tests for Callsites to help catch this kind of issue...

@hds
Copy link
Collaborator Author

hds commented Nov 16, 2023

Closing this without merging as I don't think it actually makes sense. See #474 (comment) for more details.

@hds hds closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants