Event Groups

A standard set of performance events and metrics are specified below. The events and metric definitions are organized into groups, many of which include a hierarchy of sub-events that constrain what is counted.

Some of the event groups include both speculative and non-speculative events. In such groups, the event names are appended with one of the following.

.RET - for non-speculative events counted at retirement.
.SPEC - for speculative events, which may include events incurred by instructions that do not retire.

Note

In general, RET events are more useful for performance analysis, since they are consistent with software’s view of the instruction flow. But they can be significantly more expensive to implement, as they require event data to be staged along with the associated instruction to retirement. It is up to implementations to decide whether to support the RET, SPEC, or both variants of an event.

INST (Instruction) Events

This group contains events that count RISC-V (and custom) instructions. Each event in this group has both speculative (.SPEC) and non-speculative (.RET) versions, though only the non-speculative events are listed in the tables below.

INST events are broken down by instruction categories at the top level of the event hierarchy. When no category is included in the event name (e.g., INST.RET), all instructions are counted.

Table 1. INST Event Categories

BRJMP	Branch and jump instructions. Includes all BRANCH, JAL, and JALR opcodes, including compressed varieties.
MISPRED	Branch and jump instructions that were mispredicted.
LOAD	Memory load instructions. Includes all instructions that perform memory read operations.
STORE	Memory store instructions. Includes all instructions that perform memory write operations.
LDST	Memory load and store instructions. Represents the union of the LOAD and STORE categories.
MO	Memory ordering instructions. Includes FENCE and FENCE.TSO instructions.
INT	Integer computational instructions. Includes all integer computational instructions from RVxI, including compressed varieties. Also includes all computational instructions from the M extension and the A extension (AMO*). Whether NOP instructions are counted is implementation-dependent.
FP	Floating point instructions. Includes all instructions from the F, D, Q, and Zfa extensions.
RVV	Vector instructions. Includes all instructions from the V extension.
RVC	Compressed instructions. Includes all instructions from the C extension.

Control Transfer Instruction Events

The control transfer instruction categories, BRJMP and MISPRED, include additional levels of hierarchy that allow counting per transfer instruction type. When no type is included in the event name (e.g., INST.BRJMP.RET), all types are counted.

Table 2. Control Transfer Event Types

BRANCH	Branch instructions.
BRANCH.TK	Taken branch instructions.
BRANCH.NT	Not-taken branch instructions.
IND	Indirect (uninferrable) jump and call instructions.
IND.CALL	Indirect (uninferrable) call instructions.
IND.JUMP	Indirect (uninferrable) jump instructions without linkage.
IND.LJUMP	Other indirect (uninferrable) jump instructions with linkage.
DIR	Direct (inferrable) jump and call instructions. Applies only for BRJMP, not MISPRED.
DIR.CALL	Direct (inferrable) call instructions. Applies only for BRJMP, not MISPRED.
DIR.JUMP	Direct (inferrable) jump instructions without linkage. Applies only for BRJMP, not MISPRED.
DIR.LJUMP	Other direct (inferrable) jump instructions with linkage. Applies only for BRJMP, not MISPRED.
CORSWAP	Co-routine swap instructions.
RETURN	Function return instructions.
TRAPRET	Trap return instructions.

All types listed above utilize the control transfer type definitions provided by the trace and Smctr/Ssctr specs. For completeness, the definitions are replicated below.

Table 3. Control Transfer Type Definitions

Transfer Type Name	Associated Opcodes
Indirect call	JALR x1, rs where rs != x5
	JALR x5, rs where rs != x1
	C.JALR rs1 where rs1 != x5
Direct call	JAL x1
	JAL x5
	C.JAL
	CM.JALT index
Indirect jump (without linkage)	JALR x0, rs where rs != (x1 or x5)
Indirect jump (without linkage)	C.JR rs1 where rs1 != (x1 or x5)
Direct jump (without linkage)	JAL x0
	C.J
	CM.JT index
Co-routine swap	JALR x1, x5
	JALR x5, x1
	C.JALR x5
Function return	JALR rd, rs where rs == (x1 or x5) and rd != (x1 or x5)
	C.JR rs1 where rs1 == (x1 or x5)
	CM.POPRET(Z)
Other indirect jump (with linkage)	JALR rd, rs where rs != (x1 or x5) and rd != (x0, x1, or x5)
Other direct jump (with linkage)	JAL rd where rd != (x0, x1, or x5)
Trap returns	MRET
Trap returns	SRET

Note	Trap returns are only counted if the originating privilege mode is enabled for counting. Thus a counter configured to only count in U-mode will never increment for a trap return.

In addition, there are events defined for combinations of the types above.

Table 4. Control Transfer Combination Event Types

TK	JAL, JALR, MRET, SRET, and taken branch instructions.
PRED	Predicted branch and jump instructions. Represents the union of BRANCH, IND, CORSWAP, and RETURN types above. This is implicit for MISPRED, so applies only for BRJMP.
JUMP	Indirect and direct jump instructions. Includes jump instructions with linkage, but not calls or returns.
CALL	Indirect and direct call instructions.
RETALL	All function return instructions, the union of RETURN and CORSWAP.
CALLALL	All function call instructions, the union of CALL and CORSWAP.
UNCOND	JAL, JALR, MRET, and SRET instructions.

Memory Access Instruction Events

The memory access instruction categories, LOAD, STORE, and LDST, include additional levels of hierarchy that allow counting per data source, address source, or cacheability. When no additional qualifier is included in the event name (e.g., INST.LOAD.RET), all types are counted.

Table 5. Memory Event Categories

UC	Instructions that perform a data access to an uncacheable region of memory.
DSRC.*	Instructions that accessed data from the specified source (see Table 6).
ASRC.*	Instructions whose data address translation came from the specified source (see Table 7).

Table 6. DSRC Event Data Sources

STLF	Load instructions to which data was forwarded by an older store.
<cache>	Instructions for which the data access hit in the selected cache. See cache naming standards in CACHE Events.
<cache>.MISS	Instructions for which the data access missed in the selected cache. See cache naming standards in CACHE Events.
<cache>.MERGE	Instructions for which the data access merged with an outstanding miss in the selected cache. See cache naming standards in CACHE Events.
LOCAL.MEM	Instructions for which the data access hit in local memory.
REMOTE.MEM	Instructions for which the data access hit in remote memory.
REMOTE.CACHE	Instructions for which the data access hit in a remote cache.
REMOTE.HITM	Instructions for which the data access hit modified data in a remote cache.

Note	Some instructions perform multiple memory accesses. Because these events count instructions, the event should be incremented if any access performed by the associated instruction meets the event criteria.

Note	What constitutes local vs remote above is left up to implementations. It is expected that remote accesses incur significantly more latency than local accesses. A straightforward approach may be for local to imply the same NUMA node, while remote implies a different NUMA node.

Note	Implementations may execute some memory accesses post-retirement. In such cases, even non-speculative (.RET) DSRC events may not be reflected in the counter immediately after the associated instruction retires.

Table 7. ASRC Event Translation Sources

<tlb>	Instructions for which the data address translation hit in the selected TLB. See TLB naming standards in TLB Events.
<tlb>.MISS	Instructions for which the data address translation missed in the selected TLB. See TLB naming standards in TLB Events.

Vector Instruction Events

The RVV events include additional levels of hierarchy that allow counting per vector instruction type. When no type is included in the event name (e.g., INST.RVV.RET), all types are counted.

Table 8. RVV Event Instruction Types

LOAD	Vector load instructions.
STORE	Vector store instructions.
LDST	Vector memory instructions, the union of the LOAD and STORE types.
CFG	Vector configuration (VSET{I}VL{I}) instructions.
ARITH	Vector arithmetic instructions.
ARITH.INT	Vector arithmetic vector-integer instructions.
ARITH.FP	Vector arithmetic vector-FP instructions.

The LOAD, STORE, and LDST types include an additional level of hierarchy that allows counting per addressing mode. When no addressing mode is included in the event name (e.g., INST.RVV.LOAD.RET), all modes are counted.

Table 9. RVV Memory Instruction Addressing Modes

UNIT	Vector unit-stride access instructions.
IDXU	Vector indexed-unordered access instructions.
STRD	Vector strided access instructions.
IDXO	Vector indexed-ordered access instructions.

INST Event and Metric Tables

INST Events

adoc_event_tables/inst.adoc

INST Metrics

adoc_event_tables/inst_metrics.adoc

LOAD, STORE, and LDST Events

This group contains events for explicit load and store operations performed by instructions. It does not count prefetch accesses or implicit accesses, such as page table lookups.

Each event in this group has both speculative (.SPEC) and non-speculative (.RET) versions, though only the non-speculative events are listed in the tables below.

The breakdown of these events is identical to that described in Memory Access Instruction Events, with the caveat that these events count each memory access. The INST.{LOAD,STORE,LDST} events count only instructions, some of which may perform multiple accesses.

LOAD, STORE, and LDST Events

adoc_event_tables/ldst.adoc

CACHE Events

This group contains events and metrics for instruction, data, and unified caches. Events in this group count accesses from the cache’s perspective; if an instruction performs two cache accesses, they will be counted separately. Further, all events in this group count speculative accesses.

The top level of the CACHE event hierarchy identifies the cache associated with the event. The standard cache naming scheme is detailed below.

The first character is L, for level.
The second character indicates the cache level. The nearest cache is level 1, followed by level 2, and so on. Level L implies the last level cache.
The optional third character indicates whether the cache is accessed only by instruction fetches (I) or only by data accesses (D).

A typical example implementation might include the following caches within the CACHE event hierarchy.

Table 10. CACHE Event Caches Example

L1D	Level 1 data cache
L1I	Level 1 instruction cache
L2	Level 2 cache
L3	Level 3 cache
LL	Last level cache

Note	The events for the last level cache may be aliased to the events for one of the cache levels listed above it.

Note	An implementation may additionally include other, custom caches within the CACHE event hierarchy, such that the cache name does not conform to the standard convention above.

Like all events in this spec, these CACHE events are per-hart. For caches that may be shared by multiple harts, the CACHE events should increment only for occurrences directly associated with the local hart.

The second level of the CACHE event hierarchy identifies the access or operation type, as shown in the table below. Some caches may support only a subset of the defined types.

Table 11. CACHE Event Access or Operation Type

RD	Data reads that lookup the cache. This includes explicit reads (e.g., the LW instruction) and implicit reads (e.g., page table walks).
RD.DATA	The subset of reads that are data load operations.
RD.PREF	The subset of reads that are incoming prefetch reads.
RD.CODE	The subset of reads that are code fetch reads.
WR	Data writes to the cache. This includes explicit reads (e.g., the SW instruction) and implicit reads (e.g., page accessed/dirty attribute updates).
RW	Data reads and writes, the union of the RD and WR types above.
FILL	Cache misses or prefetches that result in a line in the cache being filled with data from memory or a higher-level cache.
WB	Modified lines written out from the cache to memory or to a higher-level cache.
SNOOP	Cache coherency snoops.
OPREF	Outbound prefetches, for the purpose of pulling lines into the associated cache.

For caches that may be shared by multiple harts, the lookup events (RD*, WR, RW) and SNOOP events should increment only for requests from the local hart. FILL and WB events should increment only for fills and writebacks resulting from requests from the local hart. OPREF events should not increment.

Some events in the CACHE event hierarchy include additional levels. For cache lookup access types (RD*, WR, or RW), a third level counts per lookup result.

Table 12. CACHE Event Lookup Results

ACCESS	Cache lookups
HIT	Cache hits
MISS	Cache misses
MERGE	Lookups that merged with an outstanding lookup

For the RD.DATA and RD.CODE types, there is additionally a MISS.CYCLES event that counts core cycles (at the rate of the Zicntr cycle counter) while a load or instruction fetch cache miss, respectively, is outstanding.

The SNOOP type events have two additional levels of hierarchy.

Table 13. CACHE Event Snoop Types

LOCAL	Snoops from accesses local to the hart
REMOTE	Snoops from remote harts

If neither the LOCAL or REMOTE modifier is included in the event name, then all snoops are counted.

Table 14. CACHE Event Snoop Results

ACCESS	Snoop lookups
HIT	Snoops that hit unmodified data
HITM	Snoops that hit modified data
MISS	Snoops that miss

An reference example of the full set of CACHE events and metrics, assuming the example set of caches shown above in Table 10, is illustrated in the tables below.

CACHE Events

adoc_event_tables/cache.adoc

CACHE Metrics

adoc_event_tables/cache_metrics.adoc

TLB Events

This group contains events and metrics for translation lookaside buffers (TLBs). Events in this group count accesses from the TLB’s perspective; if an instruction performs two address translations, they will be counted separately. Further, all events in this group count speculative accesses.

The top level of the TLB event hierarchy identifies the level of the TLB associated with the event. The naming scheme matches that for caches, see CACHE Events.

A typical example implementation might include the following TLBs within the TLB event hierarchy.

Table 15. TLB Event Caches Example

L1D	Level 1 data TLB (DTLB)
L1I	Level 1 instruction TLB (ITLB)
LL	Level 2 shared TLB (STLB)

Note	An implementation may additionally include other, custom TLBs within the TLB event hierarchy, such that the cache name does not conform to the standard convention above.

The second level of the TLB event hierarchy identifies the access or operation type, as shown in the table below. Some caches may support only a subset of the defined types.

Table 16. TLB Event Access or Operation Type

LOAD	Load address translation requests.
STORE	Store address translation requests.
LDST	Load or store address translation requests.
CODE	Instruction fetch address translation requests.

A third level counts requests per lookup result.

Table 17. TLB Event Lookup Results

ACCESS	TLB lookups
HIT	TLB hits
MISS	TLB misses
MERGE	Lookups that merged with an outstanding lookup

For the LDST and CODE types, there is additionally a MISS.CYCLES event that counts core cycles (at the rate of the Zicntr cycle counter) while a load/store or instruction fetch TLB miss, respectively, is outstanding.

A reference example of the full set of TLB events and metrics, assuming the example set of TLBs shown above in Table 15, is illustrated in the tables below.

TLB Events

adoc_event_tables/tlb.adoc

TLB Metrics

adoc_event_tables/tlb_metrics.adoc

FETCH Events

This group contains events that count instruction fetch requests, broken down by the address translation source or the instruction data source. All events count speculative fetch requests.

Table 18. FETCH Event Categories

DSRC.*	Instruction fetches that accessed data from the specified source (see Table 19).
ASRC.*	Instruction fetches whose address translation came from the specified source (see Table 20).

Table 19. DSRC Event Instruction Fetch Data Sources

<cache>	Instruction fetches for which the instruction data access hit in the selected cache. See cache naming standards in CACHE Events.
<cache>.MISS	Instruction fetches for which the instruction data access missed in the selected cache. See cache naming standards in CACHE Events.
<cache>.MERGE	Instruction fetches for which the instruction data access merged with an outstanding miss in the selected cache. See cache naming standards in CACHE Events.
LOCAL.MEM	Instruction fetches for which the instruction data access hit in local memory.
REMOTE.MEM	Instruction fetches for which the instruction data access hit in remote memory.
REMOTE.CACHE	Instruction fetches for which the instruction data access hit in a remote cache.
REMOTE.HITM	Instruction fetches for which the instruction data access hit modified data in a remote cache.

Table 20. ASRC Event Instruction Fetch Address Translation Sources

<tlb>	Instruction fetches for which the address translation hit in the selected TLB. See TLB naming standards in TLB Events.
<tlb>.MISS	Instruction fetches for which the address translation missed in the selected. See TLB naming standards in TLB Events.

FETCH Events

adoc_event_tables/fetch.adoc

GENERAL Events

This group contains general events that don’t fall into other groups.

adoc_event_tables/general.adoc

TOPDOWN Microarchitecture Analysis Events

This group contains events and metrics for use with Top-down Microarchitecture Analysis (TMA) methodology.

TMA is an industry-standard methodology introduced by Intel in characterizing the performance of SPEC CPU2006 on Intel CPUs, and since used to characterize HPC workloads, GPU workloads, microarchitecture changes, pre-silicon performance validation failures, and more.

TMA allows even developers with minimal microarchitecture knowledge to understand, for a given workload, where bottlenecks reside. It does so by accounting for the utilization of each pipeline "slot" in the microarchitecture. As an example, for a 4-wide implementation, there are 4 slots to account for each cycle. When the hardware is utilized with optimal efficiency, each slot is occupied by an instruction or micro-operation (uop) that will go on to execute and retire. When bottlenecks occur, due perhaps to a cache miss, branch misprediction, or any number of other microarchitectural conditions, some slots may be either unused or discarded, which results in inefficiency and reduced performance. TMA is able to identify these wasted slots, and the stalls, clears, misses, or other events that cause them. This enables developers to make informed decisions when tuning their code.

TMA accomplishes this by defining a set of hierarchical states into which each uop slot is categorized. Each cycle, the frontend of the processor (responsible for instruction fetch and decode) can issue some implementation-defined number (N) of instructions/uops to the backend (instruction execution and retire). Hence there are N issue slots to be categorized per cycle. At the top level of the TMA hierarchy, issue slots are categorized as described below.

Figure 1. Topdown Level 1

Table 21. Topdown Level 1 State Descriptions

Frontend Bound	Slots where the frontend did not issue a uop to the backend, despite the backend being able to accept uops. Example causes include stalls that result from cache or TLB misses during instruction fetch.
Backend Bound	Slots where the backend could not consume a uop from the frontend. Example causes include backpressure that results from cache or TLB misses on data (load/store) accesses, or from oversubscribed execution units.
Bad Speculation	Uops that are dropped, as a result of a pipeline flush. Example flushes include branch/jump mispredictions, memory ordering clears, exceptions, and interrupts. This state also includes slots that are unfilled by the frontend as the pipeline recovers from the flush, slots that otherwise would have been classified as Frontend Bound.
Retiring	Uops retired. Ideally the majority of slots fall into this state.

The full standard hierarchy of TMA states is illustrated below.

Figure 2. Topdown Full Hierarchy

Note

Some imprecision within the event hierarchy is allowed and even expected. The standard L2 and L3 events may not sum precisely to the parent L1 or L2 events, respectively, as it is expected that there will be some additional sources of bottlenecks beyond those represented by the standard L2 and deeper (L2+) events.

Because of this possible imprecision, it is recommended that lower level TMA events are examined only when the parent event count or rate is higher than expected. This avoids spending time on misleading L2+ events that may be implemented by imprecise event formulas rather than precise hardware events.

Implementations may opt to add custom L2+ events, to identify additional bottlenecks specific to the microarchitecture.

The Frontend Bound state is broken down into Frontend Latency and Frontend Bandwidth L2 states. Slots that fall into Frontend Latency are cases where the decoders are awaiting fetch data, and hence have no uops to deliver to the backend. Frontend Latency is broken down into three standard L3 states: Address, Data, and Redirect. Frontend Address Latency represents slots unfilled due to awaiting the address translation for a fetch line; e.g., due to an ITLB miss. Frontend Data Latency represents slots unfilled due to awaiting data return for a fetch line; e.g., due to an L1I cache miss. Frontend Redirect Latency counts frontend pipeline slots discarded due to fetch redirects that can result from control transfers; e.g., due to jumps or taken branches.

Frontend Bandwidth represents slots where the frontend does not deliver a uop to the backend due to decoder restrictions. This occurs in cycles where the decoders have opcode bytes, and typically are able to deliver some uops to the backend, but are unable to fill all available slots. As an example, this can occur in cases where there is decoder asymmetry, which can result in reduced decoder bandwidth due to some instructions needing to wait until a capable decoder is available.

The Backend Bound state is broken down into the Core Bound and Memory Bound L2 states. Memory Bound represents slots where the backend is stalled (cannot accept uops from the frontend) due to stalls resulting from memory operations (loads and stores). Memory Bound is broken down into the Address and Data standard L3 states. Memory Address Bound represents slots where the backend is stalled due to waiting for a data address translation; e.g., due to a DTLB miss. Memory Data Bound represents slots where the backend is stalled due to waiting for a load or store access to complete; e.g., due to an L1D cache miss. These L3 events may not capture all causes for Memory Bound slots, which could include load/store queue full, pipeline conflicts, or other microarchitectural issues.

Memory Address Bound is broken down into L4 states per implemented TLB (see Table 15), plus the Page Walk state. So the Memory Address LL TLB State represents slots where the backend is stalled due to awaiting address translations from the last-level TLB, while the Page Walk state represents slots where the backend is stalled due to awaiting address translations that miss all TLBs and require a page walk.

Similarly, Memory Data Bound is broken down into L4 states per implemented cache (see Table 10), plus the External Memory state. The External Memory State represents slots where the backend is stalled due to awaiting data for requests that missed the last-level cache.

Counting Memory Bound slots per address or data source can be challenging, since, for example, the source of load data is typically only known once the stall ends and data is returned. To simplify this counting, the events defined below count slots stalled due to a miss at each TLB or cache level. This allows analysis tools to compute, for example, the slots attributed to accesses that hit cache Lx by counting the slots lost due to misses in cache Lx-1 minus those lost due to misses in cache Lx.

Core Bound represents slots where the backend is stalled for reasons other than memory-induced stalls. Core Bound has only a single standard L3 event, Serialization, which represents slots blocked due to serialization stalls that result from operations such as fences or CSR accesses. Other causes of Core Bound slots include backpressure resulting from long latency arithmetic operations, or execution unit oversubscription.

Warning

The line between Memory Bound and Core Bound can be blurry. If the backend is stalled due to register file full, it could be due register resources being held by outstanding memory accesses (Memory Bound), or by a string of long-latency arithmetic operations (Core Bound). Ultimately it will be up to implementers to determine how to handle any unclear cases that may exist in their implementations, which could include making a "best guess" between Memory Bound and Core Bound, or creating additional Backend Bound L2 events. Including this comment to consider if the spec should include any additional guidance on this, or if it is implicit.

The events which follow count slots for each of the states listed above, while the metrics express the slots per state value as a percentage of total slots.

adoc_event_tables/topdown.adoc

adoc_event_tables/topdown_metrics.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

body.adoc

body.adoc

Event Groups

INST (Instruction) Events

Control Transfer Instruction Events

Memory Access Instruction Events

Vector Instruction Events

INST Event and Metric Tables

LOAD, STORE, and LDST Events

CACHE Events

TLB Events

FETCH Events

GENERAL Events

TOPDOWN Microarchitecture Analysis Events

Files

body.adoc

Latest commit

History

body.adoc

File metadata and controls

Event Groups

INST (Instruction) Events

Control Transfer Instruction Events

Memory Access Instruction Events

Vector Instruction Events

INST Event and Metric Tables

LOAD, STORE, and LDST Events

CACHE Events

TLB Events

FETCH Events

GENERAL Events

TOPDOWN Microarchitecture Analysis Events