Implement A Thread Time View for Universal Traces#2320
Implement A Thread Time View for Universal Traces#2320brianrob merged 6 commits intomicrosoft:mainfrom
Conversation
These will be required later when implementing offline symbol resolution for ELF images.
Starting with the V6 file format, the processor number is always set. This enables consumers to build tools based on processor usage. For example, this enables tools such as ThreadTime and Processor stacks to be implemented for universal traces.
marklio
left a comment
There was a problem hiding this comment.
In general, this looks reasonable. I have some question/suggestions (some of which) I think merit discussion, so I'm gonna mark this "comment" instead of approve for now
| // The value is the amount of time that thread was switched out. | ||
|
|
||
| TraceThread thread = data.Thread(); | ||
| if (thread == null) |
There was a problem hiding this comment.
Is this hardening against a plausible scenario? Or is this representative of a "do not crash" principle in this codebase? That is, would a null thread here be indicative of a bug elsewhere that we are making hard to diagnose by simply writing to Debug?
There was a problem hiding this comment.
For universal traces this would be a bug. We create all of the threads that we know about when opening the trace for the first time. It's important to check for null here since we don't want to break the tool if we introduce a bug elsewhere. That said, we could potentially do something better here. From a layering perspective, there's not a good log to write to, but that's something that we could fix going forward. Thoughts?
There was a problem hiding this comment.
This is a hard call when a tool is so widely used on a diversity of inputs we can't hope to reasonably cover in validation. Do we use Watson to look at crashes? If this were my tool, I'd want to be set up to get the very best diagnostic artifacts. What is the experience of this "log and ignore" approach? Is it actually less annoying than an actionable crash for the type of user that runs the tool? Would be sweet to have the UI be isolated from this processing in a way that allowed actionable diagnostics to flow and provide a good UI experience, but that's a lot of refactoring.
There was a problem hiding this comment.
On Watson - I don't usually look at it too much, but on occasion yes. I would say that yes, the current behavior is less annoying than the crash. I usually see this fall into one of two buckets:
- There is a single or very small number of events that get ignored because the thread for whatever reason didn't get created.
- We're missing something important in the trace and when you open the view it's effectively empty.
In the first case, the user's analysis is almost always unaffected. In the second case, it's very obvious something went wrong and the user will re-collect and/or report the issue.
I do agree with you - there is a pattern missing here around how we egress diagnostics. Some of the very core pieces of TraceEvent have logging capabilities that PerfView just has to turn on. That pattern likely just needs to be extended here (and frankly to many other places).
| // If the thread wasn't previously marked as blocked, then mark it. | ||
| if (!m_threadState[(int)thread.ThreadIndex].ThreadBlocked) | ||
| { | ||
| double startTimeStampRelMsec = m_eventLog.QPCTimeToRelMSec(data.TimeStampQPC - (long)data.Value); |
There was a problem hiding this comment.
Below, you use data.TimeStampRelativeMSec. This is calculation different? The answer to this may answer my question above
There was a problem hiding this comment.
TraceEvent does most of its calculations in milliseconds, though the data type is a double so we get fractional milliseconds. The reason this one is in QPC is because the entity that writes the event is writing it in QPC and not milliseconds.
There was a problem hiding this comment.
I'm sure I'm missing something. On line 731 (with the same data), it uses .TimeStampRelativeMSec. And further below on 754, which seems like the same "mark as blocked" scenario, it also uses .TimeStampRelativeMSec. What is different about THIS callsite that it does the calculation from QPC?
There was a problem hiding this comment.
I don't think this should block this particular improvement. I'm more just curious in what our story is for reacting to customer pain with this tool.
|
Thank you for the review @marklio. Answered your questions and pushed some changes. |
Traces generated by https://github.com/microsoft/one-collect can provide a comprehensive view of system behavior through CPU and context switch events.
Modify the
ThreadTimeStackComputerto support aggregation of CPU and blocked time from theUniversal.Eventsprovider defined in one-collect.Expose the new view for nettrace files that contain these events.