Context-Centric Profile Views #183

blue42u · 2022-03-05T23:10:21Z

blue42u
Mar 5, 2022

(Starting this off as a "discussion" since it's larger than my other suggestions, unfortunately this means it can't be added to milestones or projects since GitHub considers it different from "issues." Sigh.)

RFC @laksono @jmellorcrummey @martyitz

Background
The profile views are suited for inspecting the performance of the application as a whole, the top-down view finds slow code starting at the program and thread roots, and the bottom-up view locates causes of low-level hotspots. These "performance extrema" work well for identifying first-order performance issues (where there are few calling contexts between extrema), but for complex code that has already been refined the remaining performance issues tend to be second-order: while there are definitely functions that could be improved, costly calls are sprinkled throughout the code and no one line within the function (and its callees) is a hotspot.

To give a more concrete example, I recently inspected the performance of malloc in Dyninst/hpcstruct as a potential second-order bottleneck, some images of the bottom-up view are shown below. There happens to be a low-level hotspot within malloc at __GI___mprotect, indicating that the performance problem has to do with the amount of memory allocated. This is a misleading assessment however, this hotspot only accounts for half the total performance impact of malloc. Unfortunately there is no good way to identify what in malloc is causing the issue, just that there's an issue.

To assist with complex cases like this, I'd like to propose a redesign of the profile views to be "context-centric," that is there is a "center" context where all performance is presented relative to and which can move like a cursor through the call graph. My mental image is similar to the call graph view from KCacheGrind, IIRC this also has some similarity to PA's profile interface.

Below is my rough thoughts on what needs to change, apologies in advance for the brain dump.

Top-Down View
The top-down view should present the merged sub-trees rooted at the "center" context. By default the center is the invisible root of the CCT, so at startup it should present what it does currently:

Experiment Aggregate
`- main
   `- real_main
   |  `- foo
   |     `- bar
   `- foo
      `- baz

But if the center is changed to foo, the sub-trees should be merged to produce the performance within foo itself:

foo
|- bar
`- baz

Note that there is no "Aggregate" row in this latter case, the center is always the topmost row of the view. This ensures the "state" is fully visible at all times.

Bottom-Up View
The bottom-up view should present callers as it does now, except that it should only ever present callers of the center. For instance, with a center of foo only the tree starting at foo should be presented:

foo
|- real_main
|  `- main
`- main

In the default case (where the center is the invisible CCT root), the bottom-up view should instead present nothing (outside of an informational "Select a context" message). The rationale for this is to encourage using the flat view for locating hotspots, since the flat view is able to identify hotspots regardless of level. (Also to remove redundancy, and because the flat view is more useful but IMHO often underutilized.)

Note that again, the center is always the topmost row of the view. This again ensures the "state" is visible at all times.

Flat View and Source Pane
The flat view is good as-is, however in addition to presenting the flat contexts it should now additionally highlight the center context. This allows the flat view to serve as another visible "state" of the call graph traversal, unlike the other views however the flat view additionally provides surrounding context of the application (regardless of call structure).

To better serve that purpose, the flat view should live in the upper pane away from the top-down and bottom-up views, allowing both the flat and call-based views to be visible at once. For sanity hyperlinks in the flat view should require additional clicks before opening the associated source view.

Similarly, the source panes should also highlight the center context if visible, this allows them to serve as a variant of the flat view that presents source code. This also integrates well with #147, if used a gutter added for that can additionally link back to the associated line context in the flat view.

Navigation
The reason for keeping the "center" global across all views is to improve navigation between the currently very separate views. As mentioned before the center is presented in a prominent location in every view, in addition here all views (top-down, bottom-up, flat, source) should have an action (button + option in r-click menu + hotkey) to change the center to the selected context. In the case of the source view, this should refer to the associated line context. Another button should be present to reset the center back to the invisible CCT root.

This should remove or replace the "zoom-in/out" button, which conflicts in semantics (the topmost row has no special meaning there) and also is inferior to "smooth" locomotion options like scroll bars (since "smooth" locomotion more clearly does not alter the presented information).

Note that having a line context (or other syntactic constructs, like loops) as the center should be possible. The top-down view should present the calls on that line, the bottom-up view should present the callers that resulted in time taken in that line. This allows for finer inspection of second-order effects, where the caller's arguments resulted in poor performance in a particular part of a function.

Multiple Center Contexts
An extension to the above ideas is to support multiple "center" contexts, in this case the top-down and bottom-up views merge all center contexts to form their roots. This allows the performance of a whole C++ class to be inspected, instead of having to dig through each method in turn. This "special merge" should only occur on the topmost row, all other rows should use the usual merging logic. To reduce noise the topmost row should use a special <multiple contexts> indicator for its context, rather than listing all selected functions.

The interface for this should be identical to the normal "center" selection, just selecting multiple rows instead of one. Nattable already supports multi-row selection with the appropriate modifier keys.

In addition, the flat view should allow for selection of fake scope contexts (file and binary) as "centers," this should select all functions within as the center contexts. (It might also help to have additional kinds of fake scopes in the flat view, such as C++ class or namespace.)

blue42u · 2022-03-07T05:24:50Z

blue42u
Mar 7, 2022
Author

@martyitz wrote:

You mention a similarity to PA, so I’d like to expand a bit on that. When I first
joined the group three years ago, I did a comparison, [attachment elided].

There is a major difference in philosophy between PA and HPCT. HPCT presents the data in the context
of execution behavior; PA presents the data in the context of the user’s sources. Its focus is much more like
the flat view than either top-down or bottom-up.

To me, that’s the single most important difference, and the most important issue to discuss for any redesign:
what is the user’s context, and how is it handled everywhere?

In the case of malloc, for example, the PA data presented for the function malloc would be aggregated across
all threads and calling contexts. The main display would also show the first-order callers and callees
in the lower part of the display.

The issue of not knowing what within malloc is problematic is addressed in PA by showing the source (or disassembly)
of the function, with metrics on each line (or instruction). For PA, what you refer to as a “center” context,
is a selected function. The selection propagates through all parts of the GUI.

PA had fairly powerful filtering metrics, that allowed the user to include or exclude call stacks
with specific functions in them or not. It also allowed time-filtering. Its also allowed navigation
by selecting a frame in a callstack on the Timeline (== TraceView), and then showing its callers and callees,
source, disassembly, etc.

It was able to do that easily because the data was recorded as events with their callstacks, and was not
reduced to a calling-context tree at run time. One downside of PA’s strategy is its relatively poor scalability
compared to HPCT. Since aggregation is not done at run-time, the data files can be much larger in PA than in HPCT.

HPCT also copes well with many, many threads and processes; PA copes less well. And there are features in
HPCT that are just absent in PA, among them GPU support, and processing of inlined-functions.

1 reply

blue42u Mar 7, 2022
Author

Thank you for the info on PA and comparison with Toolkit, this gives me a better picture of the larger context here.

To me, that’s the single most important difference, and the most important issue to discuss for any redesign: what is the user’s context, and how is it handled everywhere?

You make a good point. To answer to this question, I want to shift perspective on the problem a bit.

The primary goal of both the Viewer (and I presume PA) is to provide insight from inspecting the performance data (a bit more than merely presenting the data). The simplest question to ask is "why is X running slow?," IMO the difference in "context" you observe between the tools defines both what is "X" and what it means to "run slow" in the answered question.

AFAICT and IMHO, Toolkit defines "X" as the application as a whole and "running slow" as having globally high performance cost in some calling context(s). PA defines "X" as individual functions and "running slow" as inadequate implementation. While both are important, I do not consider either sufficient for application development.

The "context" I am aiming for is the intended implementation ("code as developed"), combining behavior and sources -- "X" can be any code construct (function, line, loop, etc.) and "running slow" means behaving with an inordinately high performance cost than intended or expected. I believe this combined "context" is necessary, a couple very simple examples:

Multiplying numbers using exp(log(a) + log(b)) is far slower than using standard multiplication. Without behavior, the finger is pointed at libm for running slow, even though it is running quite well given the use case.
Improper loop bounds (especially in C++) can result in unvectorized loops. Without sources, it is unclear why the function is not performing at appropriate speeds sometimes, especially when the performance changes along with the arguments.

As it stands, the Viewer already sports features for both kinds of "context:" top-down and bottom-up for behavior, flat and source for sources. My initial proposal (intends to) combine and strengthen the features already present.

The issue of not knowing what within malloc is problematic is addressed in PA by showing the source (or disassembly)
of the function, with metrics on each line (or instruction).

I do not consider this a complete view of what is problematic within malloc's implementation, it could very well be that The Real Problem is that the page table has grown huge and/or is expensive to update. This is only made visible by inspecting the "behavior" of malloc (ie. the performance of its callees in context). Thus my intention above.

PA had fairly powerful filtering metrics, that allowed the user to include or exclude call stacks
with specific functions in them or not.

Is this filtering in the timeline or in some other view? Based on my assumptions so far, it sounds similar to the way the "center(s)" only show calling context trees containing them.

It also allowed time-filtering. Its also allowed navigation
by selecting a frame in a callstack on the Timeline (== TraceView), and then showing its callers and callees,
source, disassembly, etc.

It was able to do that easily because the data was recorded as events with their callstacks, and was not
reduced to a calling-context tree at run time.

I suggested some of what the trace view is missing from this description a bit ago: #96. Time-filtering is not possible for the technical reasons mentioned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context-Centric Profile Views #183

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Context-Centric Profile Views #183

blue42u Mar 5, 2022

Replies: 1 comment · 1 reply

blue42u Mar 7, 2022 Author

blue42u Mar 7, 2022 Author

blue42u
Mar 5, 2022

Replies: 1 comment 1 reply

blue42u
Mar 7, 2022
Author

blue42u Mar 7, 2022
Author