Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make EventPipeProviderCallbackData own the filter data #42307

Merged
merged 3 commits into from
Sep 17, 2020

Conversation

davmason
Copy link
Member

Fixes #39669

There is a use after free bug in EventPipe shutdown. The important steps in shutdown are:

  • In EventPipe::DisableInternal we call s_config.Disable(*pSession, pEventPipeProviderCallbackDataQueue) to generate a list of EventPipeProviderCallbackData to make callbacks to the providers after the session is closed. This copies some data from the EventPipeSessionProvider like keywords, level, filter data, etc.
  • We then call pSession->Disable() which tears the session down and, importantly, deletes the list of EventPipeSessionProvider for the session
  • At this point the pFilterData field on EventPipeProviderCallbackData is pointing to freed memory since the EventPipeSessionProvider is what used to own the memory. It works for all the other fields because they are simple types, but the filter data is a dynamically allocated string.

This would cause a crash every once in a while when the memory would be overwritten and we would try to convert garbage in to a UTF-8 string to parse it.

This change fixes the issue by making EventPipeProviderCallbackData own the filter data string (it is copied from the session provider). I made sure to make the copy constructors as deleted so it can't be copied more then the initial time. The only other alternatives I see would be to mess with the order of shutdown and try to keep the session partially alive until the provider callback can happen, or to change the ownership of the filter data string to some other data structure.

The perf impact of one extra copy should be negligible, and it makes the code much simpler than the other alternatives.

@davmason davmason added this to the 6.0.0 milestone Sep 16, 2020
@davmason davmason requested review from hoyosjs, josalem and a team September 16, 2020 10:00
@davmason davmason self-assigned this Sep 16, 2020
@davmason
Copy link
Member Author

@tommcdon should I attempt to port this to 5.0 at this point? The hit rate for the failed test is extremely low, it would only fail once every week or two.

@davmason davmason force-pushed the filterdata_cache_issue branch from b03d466 to e9985fa Compare September 16, 2020 10:06
Copy link
Contributor

@sdmaclea sdmaclea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we aren't using smart pointer? Using bare pointers leads to unclear ownership semantics and is currently not considered best practice.

@davmason
Copy link
Member Author

Is there a reason we aren't using smart pointer? Using bare pointers leads to unclear ownership semantics and is currently not considered best practice.

I don't have a great answer other than we aren't currently using them. We have a couple things like NewArrayHolder and ReleaseHolder, but they are not really the same thing as a shared_ptr. They don't do ref counting and don't support copying, only construction or moving. They are typically used either in a method body or as a field on a class that never gets copied. In EventPipe we don't use them for fields, they are currently all manually managed.

My assumption (for what it's worth) is that this is intentional, in the runtime we prefer to have clear ownership over memory lifetime. I.e. a piece of memory is owned by one object and when that object goes away so does all the memory it uses, and a shared_ptr style smart pointer would break that.

The thing that bothers me about this particular problem is that the session provider conceptually owns the filter data string, and having to copy it is a sign that the session provider is being freed before it's really out of scope. The original code where it takes a raw pointer to the string is the same coding pattern used predominantly in the runtime, it's just that the session providers are freed too early. On the other side of the argument is the shutdown path is already complicated and hard to reason about, and adding another place where we partially defer work is only making that worse.

@tommcdon
Copy link
Member

@tommcdon should I attempt to port this to 5.0 at this point? The hit rate for the failed test is extremely low, it would only fail once every week or two.

@davmason does this issue repro when starting /stopping an event pipe session, or only at CLR shutdown? If the latter I suggest that we fix for 6.0, otherwise we should consider for 5.0.

Copy link
Contributor

@josalem josalem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that bothers me about this particular problem is that the session provider conceptually owns the filter data string, and having to copy it is a sign that the session provider is being freed before it's really out of scope.

Agreed

On the other side of the argument is the shutdown path is already complicated and hard to reason about, and adding another place where we partially defer work is only making that worse.

Also agreed.

The perf impact of one extra copy should be negligible, and it makes the code much simpler than the other alternatives.

I imagine this will have effectively no perf impact in the grand scheme of things. Session shutdown is not a high frequency code path and the time it takes to disable a session is dominated by rundown and other work.

This patch looks good to me. I'm surprised this failure doesn't hit more often, especially in CI. There is a clear ownership/lifetime issue though, so we should try to address that as well.

I agree with @tommcdon for backporting. We should at least see if meets bar. If the patch is too large, we could scale back to just doing the wcspy_s call with the existing code without too much effort.

@sywhang
Copy link
Contributor

sywhang commented Sep 16, 2020

Thanks for fixing this @davmason!

does this issue repro when starting /stopping an event pipe session, or only at CLR shutdown?

This will happen when stopping an EventPipe session.

IMO, the fix is simple enough to be considered for 5.0 fix.

@@ -205,12 +205,12 @@ void EventPipeProvider::AddEvent(EventPipeEvent &event)
}
Copy link
Member

@jorive jorive Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add a precondition that the input pointer parameter is not null.

@davmason davmason merged commit 873d963 into dotnet:master Sep 17, 2020
@davmason
Copy link
Member Author

/backport to release/5.0-rc2

@github-actions
Copy link
Contributor

Started backporting to release/5.0-rc2: https://github.com/dotnet/runtime/actions/runs/258967226

@josalem
Copy link
Contributor

josalem commented Sep 18, 2020

CC @lateralusX

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
@davmason davmason deleted the filterdata_cache_issue branch January 20, 2021 08:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure: profiler/eventpipe/eventpipe/eventpipe.sh
7 participants