This document is an attempt to collect knowledge of computer architecture simulation practices that can be used for program analysis, hardware design, and next-generation development of computer systems.
One of the difficulties in computer architecture simulation is the need for speeding up the very long simulation times. Over the past few decades, researchers and practitioners have used a myriad of techniques to accomplish this goal. But, rarely do we document both the successes and limitations of such techniques. In many cases, some techniques can be applicable to some workload types, and some might not be. In addition, there can be some methodologies that perform better for specific classes of applications. It is this set of trade-offs that we aim to present in this document, in addition to links to complete tutorials and use cases. We hope that as these methodologies improve over time, we can continue to augment this document so that it can become a way to bring the collective knowledge of years of computer architecture research to a single place.
Computer architecture at a high level can be described as "the internal organization of a computer in an abstract way; that is, it defines the capabilities of the computer and its programming model." [1] It also includes the Instruction Set Architecture (ISA) design, microarchitecture design, and logic design and implementation [2], [3].
The timing simulation next-generation computer sytems allows us to understand the performance of a new architecture that exists, or has yet to be built. This includes the timing of critical components, and tends to be at the cycle-level, but does not need to simulate down to the transistor or logic layers (although, using logic synthesis and layout design flows can help to provide more accurate energy, power, and area analysis). Timing simulation is different from architecture (or ISA) emulation. Emulating a computer system typically refers to being able to functionally model an ISA of that system, which can be done with popular emulators like QEMU.
There are a large number of computer architecture simulators in use today. Below is a non-exhaustive list of some recent simulators.
- CPU Simulators
- GPU Simulators
- Heterogeneous Simulators
- gem5 APU simulator
- gem5-gpu
- Full-system Simulators
- Distributed Systems/Network Systems:
- RTL Simulators
- Verilator
- Icarus Verilog
- ModelSim
- XSIM
- QuestaSim
- NCSim
- VCS
- Riviera-PRO
- CVC
- RepCut
- Parendi
- FPGA-Accelerated Simulators
- Analytical Models
Running modern workloads like the reference (large) inputs of the SPEC CPU2017 benchmarks or the Cloudsuite datacenter benchmark suites can take years to simulate to completion due to simulator speeds being at least 10,000
It is possible to generate small, synthetic workloads that take significantly less time to simulate, like has been done with previous works. As with all methodologies, there are tradeoffs for using them. MAMPO is a multithreaded synthetic power virus generation framework targeting multicore systems. It uses a genetic algorithm to search for the best power virus for a given multicore system configuration. SynchroTrace is a trace-based multi-threaded simulation methodology that accurately replays synchronization- and dependency-aware traces for chip multiprocessor systems. SynchroTrace achieves this by recording synchronization events and dependencies in the traces, allowing for the replay on different hardware platforms. GPGPU-Minibench captures the execution behavior of existing GPGPU workloads in a profile, which includes a divergence flow statistics graph (DFSG) to characterize the dynamic control flow behavior of a GPGPU kernel. GPGPU-MiniBench generates a synthetic miniature GPGPU kernel that exhibits similar execution characteristics as the original workload. G-MAP statistically models the GPU memory access stream locality by considering the regularity in code-localized memory access patterns and the parallelism in the execution model to create miniaturized memory proxies. Mystique is yet another technique that generates benchmarks from production AI models by leveraging PyTorch execution traces. For Ditto, another recent work focusing on synthesizing workloads for data centers, the techniques can do well when mimicking traditional CPU performance behaviors, like branch mispredictions, cache miss rates, and IPC, these workloads generation techniques might not be applicable to all workload studies, like cache compression or prefetching.
Instead of generating smaller, synthetic workloads, it can be possible to simulate a representative portion of the actual application's execution. In this way, it can now be possible to simulate the original workloads more quickly than before. Two of the most prevalent methodologies are the SMARTS and SimPoint methodologies, which are based on statistical sampling and region similarity detection, respectively.
SimPoint employed Basic Block Vectors (BBVs) as distinctive signatures to represent instruction streams of fixed or variable length intervals, leveraging the principle that similar workloads should traverse comparable sequences of basic blocks. However, the SimPoint approach failed to consider performance disparities between similar instruction streams arising from micro-architectural and hardware variations, such as cache states and frequency variations, and also neglected the effects of thread interactions in multi-threaded scenarios. SMARTS introduced a systematic sampling framework that simulated programs by alternating between fast-forward, warm-up, and detailed simulation phases, generating IPC samples for each detailed simulation. This approach enabled highly accurate estimation of program-wide IPC through statistical methods. Nonetheless, the scope of SMARTS was limited to estimating overall IPC and could not provide granular IPC traces throughout program execution. LiveSim extended the statistical sampling framework by incorporating in-memory checkpoints at sample regions, facilitating interactive simulations. Similar to SMARTS, LiveSim focused on IPC estimation with confidence levels but lacked the ability to generate detailed IPC traces. pFSA employed hardware virtualization to spawn processes that could fast-forward to regions of interest (ROIs) at near-native speeds and execute detailed simulations concurrently. Additionally, pFSA introduced a novel cache warm-up technique based on estimating the error introduced by insufficient cache warming.
SimFlex introduced a multi-threaded sampling technique, building upon the SMARTS methodology, by strategically sampling processors executing the program's critical path. COTSon extended the simulation scope to encompass the entire software stack and hardware models, ensuring both high performance and accuracy. Time-based sampling methodologies pioneered a generic simulation framework for multi-threaded applications, employing time progression rather than instruction count. However, this approach was constrained by the execution time of the program, limiting its ability to identify program structures. Neither SimFlex nor Time-based Sampling leveraged software-specific knowledge, such as barriers, tasks, and loops, to inform the decomposition of programs into representative regions. BarrierPoint and TaskPoint are based on the structural characteristics of multi-threaded programs, utilizing barrier synchronization primitives and tasks as units of work, respectively. This approach enabled the automatic identification of software regularities based on the inherent synchronization primitives. Nonetheless, these methods were restricted by their dependence on specific programming paradigms, limiting their general applicability. LoopPoint proposed a generic profiling-based sampling methodology that eschewed assumptions about software style, employing loop boundaries as a heuristic for delineating representative regions. However, this heuristic failed to exploit the hierarchical nature of programs and lacked support for dynamic software and hardware changes. Pac-Sim employs live profiling and region classification to enable the sampled simulation of multi-threaded applications in the presence of hardware and software dynamism.
Kambadur et al. introduced a sampling solution for Intel GPU workloads building upon GTPin, employing kernel names, arguments, and basic block entries to identify representative regions within GPU programs at a kernel-level granularity. TBPoint utilized BBVs and other kernel-specific features to pinpoint representative kernels, while Principal Kernel Analysis (PKA) monitored IPC variations between sampling units to identify regions suitable for fast-forwarding. Both TBPoint and PKA facilitated the sampled simulation of GPU workloads at both inter- and intra-kernel levels. Sieve extended previous research, demonstrating that combining kernel names and instruction counts led to more effective sample selection. Photon employed GPU BBVs for both inter- and intra-kernel workload sampling, resulting in a substantial enhancement in sampling accuracy compared to prior methods.
The table below outlines the sampled simulation methodologies and their applicability.
Methodology | Analysis Type | Parallel Simulation | Warmup | Applicability/Workloads |
---|---|---|---|---|
SimPoint | ▣ | ✔️ | Prev Region | Single-threaded CPU |
SMARTS | □ | ❌️ | Functional | Single-threaded CPU |
LiveSim | ▣ | ✔️ | Checkpoint | Single-threaded CPU |
SimFlex | □ | ❌️ | Checkpoint | Multi-program CPU |
Time-based sampling | □ | ❌️ | Functional | Multi-threaded CPU |
BarrierPoint | ▣ | ✔️ | Prev Region | Multi-threaded CPU |
TaskPoint | ▣ | ✔️ | Prev Region | Task-based CPU |
LoopPoint | ▣ | ✔️ | Prev Region | Multi-threaded CPU |
Pac-Sim | ● | ✔️ | Statistical | Multi-threaded CPU |
▣ Profile-driven analysis □ Statistical analysis ● Online profiling
Checkpoints allow for state exchange among multiple simulators, leveraging the strengths of each. For instance, in sampled simulation, a fast warming simulator that samples workloads can export checkpoints to a detailed simulator, which, though slower, provides precise performance data. This approach enhances simulation speed while maintaining accuracy.
Checkpoints can be categorized based on the type of state they save:
- Architectural checkpoint: Captures the architectural (software-visible) state, including the register files of cores, memory, and I/O device states. Popular emulators like QEMU, Simics, gem5, and VM hypervisors (e.g., KVM) maintain these states.
- Microarchitectural checkpoint: Captures the states of microarchitectural components such as pipelines, caches, branch predictors, TLBs, on-chip network virtual channels, and DRAM controllers.
Research on checkpoints focuses on storage and adaptability. Storage solutions include compression (e.g., QEMU's incremental disk checkpoint and Simics) and pruning techniques like Livepoints, which stores only the state needed for a following short, detailed simulation window. Statistical profiles, such as MRRL and BLRL, store reuse distribution data to save space and are natural to extrapolate.
Adaptability is achieved by storing metadata as hints for post-processing. For example, Memory Timestamp Record (MTR) tracks each core's last read/write timestamp for every cache line, making it compatible with cache hierarchies that use any write-invalidate coherence state and LRU replacement policy. The aforementioned statistical profiles can also be naturally extrapolated to any cache capacity.