Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX #488

Open
andikleen opened this issue Oct 5, 2023 · 4 comments
Open

toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX #488

andikleen opened this issue Oct 5, 2023 · 4 comments

Comments

@andikleen
Copy link
Owner

e.g. on SKL

./toplev --metrics -l3 -q ./workloads/GITGREP 2>&1 | grep Bottleneck
C0-T0 Info.Bottleneck Mispredictions Scaled_Slots -1.85 [ 1.0%]
C0-T0 Info.Bottleneck Irregular_Overhead Scaled_Slots -7.60 [ 1.0%]
...

Interestingly it goes away with --single-thread so it might be a SMT issue?

@aayasin
Copy link
Collaborator

aayasin commented Oct 6, 2023

There are at least two problems with this test workload & recent toplev:

  1. The Bottlenecks View required at least level 4 tree
  2. The run time is too short of ~1 second which runs into multiplexing issues
  3. Trunk toplev stops to list the nodes with zero counts; which is used by perf-tools. revert that please.

Here is a reproducer. First line is the command to run inside perf-tools folder, followed by its output on ICX.

The first run with trunk pmu-tools and --no-multiplex shows no negative bottlenecks. Actual toplev command kept for reference.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 no-mux' -pm 10 -v1 --pmu-tools ../pmu-tools --toplev-args ' --no-multiplex'                                                                                                                                                    
INFO: App: ./workloads/GITGREP pmu-tools1 no-mux .                                                                                                                                
topdown full tree + All Bottlenecks ..                                                                                                                                            
../pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-no-mux.toplev-vl6-perf.csv --no-multiplex --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 no-mux 2>&1 | tee GITGREP-pmu-tools1-no-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort                                                                                                                          
BE/Core          Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                                   % Clocks                           18.2   <==                      
Info.Botlnk.L2   DSB_Misses                                                                                      Scaled_Slots                      2.38                           
Info.Bottleneck  Base_Non_Br                                                                                     Scaled_Slots                     32.35                           
Info.Bottleneck  Big_Code                                                                                        Scaled_Slots                      1.67                           
Info.Bottleneck  Branching_Overhead                                                                              Scaled_Slots                      9.56                           
Info.Bottleneck  Cache_Memory_Bandwidth                                                                          Scaled_Slots                      1.26                           
Info.Bottleneck  Cache_Memory_Latency                                                                            Scaled_Slots                      1.55                           
Info.Bottleneck  Instruction_Fetch_BW                                                                            Scaled_Slots                      9.60                           
Info.Bottleneck  Irregular_Overhead                                                                              Scaled_Slots                      4.69                           
Info.Bottleneck  Memory_Data_TLBs                                                                                Scaled_Slots                      1.42                           
Info.Bottleneck  Memory_Synchronization                                                                          Scaled_Slots                      0.01                           
Info.Bottleneck  Mispredictions                                                                                  Scaled_Slots                     19.24                           
Info.Bottleneck  Other_Bottlenecks                                                                               Scaled_Slots                     18.64                           
Info.System      Time                                                                                            Seconds                           1.77                           
MUX                                                                                                            %                                 100.00                           

This is the failure by default using pmu-tools at 4.6 release point.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1
INFO: App: ./workloads/GITGREP pmu-tools1 do-mux .
topdown full tree + All Bottlenecks ..
/usr/bin/python /home/admin1/ayasin/perf-tools/pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-do-mux.toplev-vl6-perf.csv --frequency --metric-group +Summary --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 do-mux 2>&1 | tee GITGREP-pmu-tools1-do-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort
BE/Core        Backend_Bound.Core_Bound                                                                      % Slots                           21.2    [30.0%]<==
Info.Botlnk.L2 DSB_Misses                                                                                      Scaled_Slots                     0.58   [ 6.1%]
Info.Bottleneck Base_Non_Br                                                                                    Scaled_Slots                   -75.96   [ 7.5%]
Info.Bottleneck Big_Code                                                                                       Scaled_Slots                     5.49   [85.8%]
Info.Bottleneck Branching_Overhead                                                                             Scaled_Slots                   114.85   [ 7.5%]
Info.Bottleneck Cache_Memory_Bandwidth                                                                         Scaled_Slots                     2.24   [ 7.5%]
Info.Bottleneck Cache_Memory_Latency                                                                           Scaled_Slots                     1.25   [12.0%]
Info.Bottleneck Instruction_Fetch_BW                                                                           Scaled_Slots                     9.59   [23.1%]
Info.Bottleneck Irregular_Overhead                                                                             Scaled_Slots                     8.49   [ 7.0%]
Info.Bottleneck Memory_Data_TLBs                                                                               Scaled_Slots                     0.42   [ 7.0%]
Info.Bottleneck Memory_Synchronization                                                                         Scaled_Slots                     0.02   [ 7.0%]
Info.Bottleneck Mispredictions                                                                                 Scaled_Slots                    14.51   [85.8%]
Info.Bottleneck Other_Bottlenecks                                                                              Scaled_Slots                    19.11   [ 7.0%]
Info.System    Time                                                                                            Seconds                          1.77
MUX                                                                                                          %                                  0.00
warning: 35 nodes had zero counts: ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use
ERROR: Too many metrics with zero counts; 35 unexpected (ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use). Run longer or use: --toplev-args ' --no-multiplex' !
 !
ERROR: Command "./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1" failed with '256' !
 !

perf-tools flags the zero counts & suggests to run longer or use no-multiplex.

@andikleen
Copy link
Owner Author

But even with multiplex issues shouldn't the formula guard against bad values? These are not uncommon.

I have a open bug on detecting too short run time for multiplexing in toplev

@andikleen
Copy link
Owner Author

Also I'm surprised that 1s is not enough anymore to get through all the groups. It must have really grown a lot.

@aayasin
Copy link
Collaborator

aayasin commented Oct 9, 2023

1s is too short.

There are around a couple dozen groups for the full tree with current toplev each group get sample <5% of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants