Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting EventPipe events via ICorProfile callbacks #11750

Closed
1 of 3 tasks
noahfalk opened this issue Jan 4, 2019 · 16 comments · Fixed by #37002
Closed
1 of 3 tasks

Support getting EventPipe events via ICorProfile callbacks #11750

noahfalk opened this issue Jan 4, 2019 · 16 comments · Fixed by #37002

Comments

@noahfalk
Copy link
Member

noahfalk commented Jan 4, 2019

Historically we had an unnecessary divide between ETW events that were available asynchronously out-of-proc and ICorProfile API callbacks that were available synchronously in-proc. Some events were available via one source, some via the other, and some via both. Although it is technically possible to create profilers that register for both streams of events it is needlessly complex and it is impossible to make the async ETW events become synchronous again. To mitigate these issues we want to make all of the ETW events available synchronously via an ICorProfile callback. And because we are cross platform now, ETW events are better abstracted as EventPipe events so that this will work properly on non-Windows platforms too.

We already had a past effort at doing this in PR dotnet/coreclr#19157 and a little bit of discussion of it in #5638. The dotnet/coreclr#19157 attempt handled receiving the EventPipe events fine, but it didn't include a mechanism to enable the events. To use it, a profiler would need to configure the set of desired events via an out-of-band mechanism such as environment variables or managed EventListener APIs. Although this would work for some scenarios, I expected it would be onerous in the common case and it would be awkward to migrate from that form of API to one where ICorProfile also had direct control over the set of events it was receiving.

Work to do:

@noahfalk
Copy link
Member Author

noahfalk commented Jan 4, 2019

@mjsabby
Copy link
Contributor

mjsabby commented May 30, 2019

The first work item in this I believe is complete.

@noahfalk
Copy link
Member Author

yep : )

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@davmason
Copy link
Member

@mjsabby (or anyone else interested) I have thought about this recently, and there is the questions of whether the EventPipe callback should be synchronous or asynchronous. The current EventPipe setup is that events are fired asynchronously. They are written to a buffer by the thread that writes the event, and then we have a separate thread that waits for either the buffer to get full or a maximum timeout to be reached, then bulk writes the event to the stream/file/etc.

The path of least resistance when listening via profiler would be to have the events also fired asynchronously to the profiler. However, then you would lose the ability to block on an event to pause the runtime. I.e. you couldn't block on the GC started event or JIT started event to delay the beginning of the GC/jitting.

If the profiler EventPipe events are fired synchronously it would introduce complexity to the profiler. You would have to be concerned with blocking too long, or risk losing events. It also means that it would be fairly trivial to introduce perf regressions to the profiled app unintentionally. Both of these are concerns with the profiler APIs, but since we think carefully about every addition to the profiler APIs it is easy to offer guidance on best practices. Once you have the ability to listen to random events there certainly will be call sequences we haven't reasoned about and have to figure out as we go.

My thinking is that profilers want the ability to block the runtime from completing and are willing to pay the extra complexity to achieve this. I.e., as long as the processing logic isn't extremely cumbersome they would desire to be able to block the runtime strategically.

Let me know about thoughts/concerns.

@noahfalk
Copy link
Member Author

noahfalk commented Feb 1, 2020

or risk losing events

My assumption is that if we are doing synchronous delivery then we wouldn't buffer, and without the buffer we'd never lose events. Blocking would still be a risk though.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 1, 2020

I also prefer synchronous events because the value for us is to react in real time. Further buffering or not blocking is left to the profiler.

Profilers already have to be written in sophisticated ways or the profilee is adversely impacted.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 4, 2020

@davmason @noahfalk this is something we’d really like for 5.0. Given that we have a reasonable amount of time before 5.0 ships can we have this be a feature we’re committing too?

@noahfalk
Copy link
Member Author

noahfalk commented Feb 7, 2020

I'd need to chat with David to see what work remains here and how much he thinks it would cost. I haven't seen him in the office last couple days so just waiting for that.

Can you describe what you'd do with the API if it was available in 5.0?

@mjsabby
Copy link
Contributor

mjsabby commented Feb 7, 2020

If we had this it would unify our eventing story that is fragmented between ETW/EventPipe and Profiler APIs. This would allow us to also react with other APIs in our app that need real time feedback.

@davmason
Copy link
Member

davmason commented Feb 8, 2020

@mjsabby it is on the backlog for 5.0 but not currently being worked on. My intent is to have it in for 5.0, but I don't think I can promise that it will for sure be there. There's always the risk of me being pulled to other high priority projects.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 20, 2020

@davmason is it possible to do this in steps? Maybe we can add the ability to also write to the profiler api and the control side of things can be added later? Like I did in dotnet/coreclr#19157

@noahfalk your thoughts?

I'm just trying to see how it can land in 5.0 because it missed, 2.2, 3.0 and 3.1 .. and if we miss 5.0 it'll be yet another year.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 20, 2020

I also meant to add given that we a diagnostic control library that maybe adding an api to enable the specific events using profiler api might not be as pressing it was previously, so that requirement as a whole could go away.

In any case, I'll revive my PR on this new repo and see where it goes.

@davmason
Copy link
Member

@mjsabby since the change to EventPipe where each listener has its own session, it wouldn't really make sense to enable listening without being able to specify what events you want to receive. I don't think it makes sense to use the diagnostics library for that. The API to specify what events should be enabled is one of the more straightforward parts of this work.

There are two parts I'm concerned about taking longer. The first is that synchronously delivering events means we have a new, profiler only, path in EventPipe that might be a constant source of bugs.

The second is that the EventPipe events are delivered encoded, so the profiler would need a way to decode the payload. I believe all of our decoding logic is in managed code, so we would have to either provide a decoding API or a sample profiler that does the decoding that other profiler implementers could use for guidance.

I think the runtime work could come as a first part, then the decoding could happen later if we had to do it in parts.

@noahfalk
Copy link
Member Author

Apologies, I meant to respond earlier and got distracted.

it wouldn't really make sense to enable listening without being able to specify what events you want to receive

+1

The first is that synchronously delivering events means we have a new, profiler only, path in EventPipe that might be a constant source of bugs

FWIW I think it would be worthwhile long-term to have the option of creating an EventPipe session that delivers its events synchronously. The profiler might be the first user of such functionality but I could imagine out-of-proc tools wanting to do it too.

so we would have to either provide a decoding API or a sample profiler that does the decoding that other profiler implementers could use for guidance.

I think it is legitimate for the runtime to provide an API that spits out bytes in an encoded format as long as we have a specification for the format (such as the already specified nettrace format). I agree having a sample or a decoding API would make it easier and potentially more efficient, but if Mukul was motivated to write a separate decoder then the overall solution would still satisfy the goal.

In any case, I'll revive my PR on this new repo and see where it goes

Did we ever get regression tests completed for the last set of profiler features you added Mukul? : ) If not I think we should ensure that the previous features have tests before we taking on new work. I know its not sexy, but its the only way to keep ourselves honest and test debt in check.

If we had this it would unify our eventing story that is fragmented between ETW/EventPipe and Profiler APIs. This would allow us to also react with other APIs in our app that need real time feedback.

Is there a story that we could attach to this that frames it as business impact rather than implementation impact? Right now I'd be telling managers the benefit is "Bing gets to make improvements to their profiler codebase and they have the opportunity for unspecified new profiler functionality in the future". A more compelling message might be "Bing will implement new profiler features X, Y, Z on top of this and that lets them half their MTTR on a hard set of production issues" or "This feature lets Bing do improved performance evaluation technique X that is required to identify their next set of potential latency improvements".

I'm just trying to see how it can land in 5.0 because it missed, 2.2, 3.0 and 3.1 .. and if we miss 5.0 it'll be yet another year.

Yeah I feel guilty suggesting it years ago and then having it keep lingering on without getting completed : ( In earlier releases we discovered we were missing some of the foundational stuff like multiple session support and that got addressed in 3.0 so progress was made. Now I think it is just a matter of priority to keep it moving forward. Either we find a good reason that the work should be high on the runtime list for 5.0 or we work with you to get it as another contribution.

@mjsabby
Copy link
Contributor

mjsabby commented Feb 21, 2020

I think it's fair to require me to do the test work for the previous profiler features before taking up this new feature :)

I will formulate how exactly it'll help us beyond improvements to the codebase.

Hopefully I get to do it well before 5.0 stops taking checkins!

@noahfalk
Copy link
Member Author

Hopefully I get to do it well before 5.0 stops taking checkins!

Last I heard we should be free to add things through May and then the bar might start rising in June. In 3.0 the exact dates/bar kept being adjusted so I can't give a hard date, but this should be the right ballpark.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants