Highly variable per-core performance in a 16-core simulation #1297
Unanswered
dirkcgrunwald
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm running some simple SE-mode simulations of a slightly modified MESI_Two_Level setup using gem5-v23.1. I'm using the BaseO3CPU, a 2-level cache network using ruby and 4-DRAMs with 2 channels each using the MeshDirCorners_XY topology. The ruby DOT plot shows the L1$ and L2$ connected to the routers (i.e. not L1 <-> L2 <-> network) and also shows the corner nodes having the DRAM mem_cntrl as well. I'm using se.py to configure & launch the gem5.opt run.
I'm then running a fairly simple application application (a version of the ["memory mountain"] (https://csapp.cs.cmu.edu/2e/ics2/code/mem/mountain/mountain.c) with reduced stride & maximum memory. The goal is to exercise the memory hierarchy and the DRAM interface. This is a purely serial app -- i.e. there's no dependence across the workloads on the 16 different cores.
The results are a little surprising. The plot below shows numCycles for each core normalized to the lowest numCycles value -- the execution time varies by a factor of 5. I'm not certain why that would be happening. In addition, when running the script, the outputs from some of the processes appear significantly before the others (...reflecting the factor of 5...).
Am I misunderstanding how SE mode runs processes? When I examine the DRAM trace, I see a series of functional writes to "load the binary" and then start to see demand fetches. The trace gets very long so it's hard to really follow it beyond the first few 100 references, but it looks like all the simulated cores start fetching at about the same time. The workloads all have similar L1 / L2 miss rates and the DRAM references are spread evenly across the DRAM controllers.
Beta Was this translation helpful? Give feedback.
All reactions