-
Notifications
You must be signed in to change notification settings - Fork 401
Memory Hierarchy, Spatial Locality, and Data Structures
One of the most important performance features of any computer system is the memory hierarchy. High-performance programs make efficient use of the memory hierarchy via algorithms and data structures that take advantage of the structure of the memory hierarchy and avoid its limitations.
The memory hierarchy of a modern computer consists of off-chip "main memory" and multiple levels of on-chip "cache". The reason there is a hierarchy instead of just a single flat structure is that "you can only optimize for two of the three axes at any one time" phenomena that often comes up in engineering. In storage, the three dimensions are latency, bandwidth, and capacity. You can have low latency and high bandwidth, but only with small capacity. You can have high capacity and low latency, but only with low bandwidth. You can have high capacity and high bandwidth but only with high latency. The memory hierarchy is composed of different structures that optimize for different combinations of metrics so that the combination achieves the appearance of us low latency, high bandwidth, and large capacity.
At the top of the hierarchy are the on-chip caches which are composed of low-latency static RAM. There are two or even three levels of on-chip cache. At the top level, there are separate caches for instructions (i.e., code) and data. These caches are typically small (16-64KB) and can be accessed in one clock cycle. The second level (L2) cache mixes instructions and data, is larger (typically 500-1000KB), and is accessible in 10-15 clock cycles. The third level cache is larger still (16MB+) and is accessible in 20-30 clock cycles. The bottom of the hierarchy, "main memory", is some combination of traditional dynamic RAM and non-volatile floating-gate RAM (i.e., "Flash"), its capacity is measured in GBs, and its latency in 100s of clock cycles.
The hierarchy operates on the principle of temporal locality, which says that any code/data that was accessed is likely to be accessed again soon. As code/data is accessed, copies of it are brought in from memory into the upper level caches, displacing code/data that was not recently accessed. Over time, the contents of the caches are (hopefully) such that 90% of the code/data that is accessed is available in the first level caches and accessible in one clock cycle, of the remaining 10% 90% is available in the L2 and accessible in 10-15 cycles, and so on, such that the latency of the average code/data access is close to one cycle. A bit of terminology, when the processor looks for a piece of code/data in a certain cache, does not find it there, and has to go to the next level, that is called a "cache miss". The presence of cache misses is one thing that makes flat instruction counts less than perfectly correlated with observed performance.
Programs exhibit temporal locality relatively naturally--manifestations of the 10/90 rule are everywhere in computing as in life--but the memory hierarchy also exploits a second form of locality. Spatial locality says that any code/data that is near (in terms of memory address) another piece of code/data that was recently accessed is likely to be accessed soon. For code, this is fairly obvious--code instructions are layed out sequentially in memory--but data exhibits spatial locality too. Adjacent members in a struct/class have adjacent addresses and if you access one you are likely to access another. Adjacent elements in a (scalar) array are also adjacent in memory and a routine that accesses array[i]
is likely to access array[i+1]
also. To exploit spatial locality, caches and main memory are arranged into blocks. In modern processors, cache blocks are typically 64B (bytes). On a cache miss, a cache pulls in not only the scalar element the processor immediately asked for (a scalar elements is 1-8 bytes) but the 64-byte chunk surrounding the element, i.e., seven other elements. Memory blocks are managed in larger chunks called pages. On modern systems, pages are 4KB (kilobytes).
High performance programs use algorithms and data structures that maximize temporal and spatial locality.
If you've read the short tutorial/think-piece on containers, you know that arrays are the name of the game. Dynamic pointer-based data structures like binary trees (std::map
) and hash tables (std::unordered_map
) do have their uses, but most of the action is in arrays especially in data and computation intensive programs like EnergyPlus. By the way when I say array in this context, I don't mean just std::array, I mean any container that stores a contiguous collection of elements of the same type including std::vector
, EPVector
, ObjexxFCL::Array1D
and all of the other ObjexxFCL arrays. including But arrays are not the only important data structure, the other important structure is ... well ... the structure, i.e., the struct
or class
. 99% of the data in EnergyPlus is organized in some combination of array and structure. Arrays of scalars, arrays of structures, arrays of arrays of structures, structures that include arrays, arrays of structures that include arrays, etc.
The basic question is "array of scalars" vs. "array of structures". Consider the various fields associated with window frames. Is it better to represent these as a struct and then have a single array of struct's indexed by surface number?
struct WinFrameType {
Real64 Area;
Real64 Conductance;
Real64 SolAbsorp;
Real64 VisAbsorp;
Real64 Emis;
Real64 EdgeToCenterGlassCondRatio;
Real64 EdgeArea;
Real64 TempIn;
Real64 TempInOld;
Real64 TempSurfOut;
};
Array1D<struct WinFrameType> SurfWinFrames;
Or is it better to represent these as individual arrays?
Array1D<Real64> SurfWinFrameArea;
Array1D<Real64> SurfWinFrameConductance;
Array1D<Real64> SurfWinFrameSolAbsorp;
Array1D<Real64> SurfWinFrameVisAbsorp;
Array1D<Real64> SurfWinFrameEmis;
Array1D<Real64> SurfWinFrEdgeToCenterGlCondRatio; // Ignore name irregularity
Array1D<Real64> SurfWinFrameEdgeArea;
Array1D<Real64> SurfWinFrameTempIn;
Array1D<Real64> SurfWinFrameTempInOld;
Array1D<Real64> SurfWinFrameTempSurfOut;
The former is somewhat less verbose and more "object-oriented" but does the second have better spatial locality? It depends on the access pattern of course. The two main questions are:
- When one of these fields is accessed for a given window, are all of the other fields also accessed for the same window?
- When one of these fields is accessed for a given window, will the same field also be accessed for the next window?
And here are the combination of answers:
- If the answer to the two questions are no/no, then there is no locality at all and we need a different organization entirely or it doesn't matter. However this is not likely to be the case.
- If the answers are yes/no, then the array-of-structure formulation is preferred. This is a possibility.
- If the answers are no/yes, then the arrays-of-scalars formulation is preferred. This is also a possibility, although an unlikely one in this case since all of the fields are related to window frames.
- If the answers are yes/yes, then both formulations have locality although the latter is a bit more "vectorization friendly" (see tutorial on vectorization friendly code here). Although vectorization friendliness has to be weighed against object-oriented programming considerations here. Vectorization is finicky, a lot of things have to be "just so" for the compiler to vectorize effectively. The compiler may not be able to vectorize for any number of reasons and choosing a vectorization friendly memory layout may be for naught.
The above discussion may lead you to believe that gathering fields into structures and using arrays-of-structures is the way to go. But there is a limit to everything. Should all fields associated with windows be collected into a single WinType? Well, care to guess how many window related fields there are in EnergyPlus? The answer is 385. Most of these are 8B integers or Real64's and some are even arrays. If all of these fields were collected into a single structure, the size of the structure would be about 3KB or most of a full memory page. If a given section of code only deals with a few of these fields, it is quite a waste to have reserved 4KB in memory only to use 1% or less of it, that type of utilization makes for poor spatial locality. Of course, utilization in the caches will be higher than that because cache blocks are only 64B, but it may still not be great. And spatial locality in main memory also matters.
What is the point of all of this? What are some rules of thumb for design? Well here are a couple:
- Very large structures that contain many fields that are not likely to be accessed within the same functions or the same loops are not a good idea. For spatial locality, and maybe even for general programming hygiene, it is better to break such large structures down into smaller structures whose fields display greater spatial locality with one another.
- If certain fields are commonly used with multiple other groups of fields, then those should also be grouped separately so that they can be dynamically cached with several different field groups.
- Vector code benefits from the arrays-of-scalars formulation. Processors do not understand the abstraction of structures, but their vector units are taylor-made for arrays-of-scalars. If you have a loop that vectorizes, but in a way that generates additional code to gather fields from structures, consider reformulating the data structures used in that loop to arrays-of-scalars instead.