Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hot Cold Splitting] Redesign the hot cold mapping data structure to allow fast lookup #1903

Open
cshung opened this issue Jun 18, 2022 · 12 comments
Labels
area-hot-cold-splitting Hot cold splitting support in crossgen2

Comments

@cshung
Copy link
Member

cshung commented Jun 18, 2022

Right now, the data structure is simply a list of (cold runtime function index, hot runtime function index), this is not efficient for lookup. We need to redesign it so that it can be lookup fast.

@cshung cshung added the area-hot-cold-splitting Hot cold splitting support in crossgen2 label Jun 22, 2022
@amanasifkhalid
Copy link
Member

Design Proposal

Current Implementation

@cshung's initial prototype for hot/cold splitting in Crossgen2 introduces a new data structure mapping indices of cold runtime functions to indices of their hot counterparts in the runtime function table. This mapping (currently referred to as Scratch in the codebase) is implemented as a list of integer indices: Mapped cold/hot integer indices are inserted into Scratch in pairs. This creates some important invariants:

  • Function indices at even indices in Scratch point to cold functions, while function indices at odd indices point to hot functions.
    • In the prototype codebase, you will see checks like if (i % 2 == 0) to check if Scratch[i] is the index of a cold function, etc.
  • The index of a cold function is immediately succeeded by the index of its hot counterpart in Scratch.
    • This facilitates obtaining the hot/cold counterpart of a cold/hot function, given its index in Scratch

Scratch is written to file as a linear collection of unsigned integer indices, each one 4 bytes long.

Problems

  • When writing to the R2R format, space is critical. If N functions are split into hot/cold sections, Scratch will contain 2N indices. If each index is 4 bytes long, we will write 8N bytes to file, plus potentially any extra data needed by Scratch. Using a more succinct data structure, we can approach the theoretical lower limit for the amount of memory required to represent this hot/cold function mapping.
  • Furthermore, searching Scratch at runtime requires a linear search. Ideally, we want a dictionary-like data structure that achieves constant-time lookup; if this takes too much space, a data structure capable of binary search would still be an improvement.

Proposal

We propose replacing Scratch with a succinct rank/select dictionary. Using a succinct dictionary would enable constant-time lookup while significantly reducing the amount of data written to file. Consider the following implementation:

At compile-time,

  • We initialize a bit vector with the size of the total number of functions in the runtime function table. We then scan the table for pairs of hot/cold-split functions: If the ith and jth functions in the table are hot/cold pairs, we set the ith and jth bits in the bit vector to 1. If the ith function in the table is not part of any hot/cold pair, we set the ith bit in the bit vector to 0. These bits are written to file.
  • While scanning the runtime function table, we note the number of split functions, and write this to file as well.

At runtime,

  • We load the bit vector and the number of split functions, perhaps as members of a dedicated class.
  • Using the bit vector, we generate rank and select tables.
    • A rank table, in this context, returns the number of bits in the bit vector equal to 1 up to a given index, inclusive. More formally, rank[x] = |{0 <= k <= x : B[k] = 1}|, where B is the bit vector.
    • A select table, given an integer x, returns the index of the xth 1 in the bit vector. More formally,
      select[x] = min{0 <= k < N : rank(k) = x}, where N is the size of the bit vector.

Given a function index x, we can perform a lookup via the following algorithm:

  • Check the xth bit of the bit vector. If it is 0, we know the function is not split, and thus does not have a pair index. Else if it is 1, proceed.
  • Look up rank[x]. Let num_split_func be the number of split functions. Determine select_index:
    • If rank[x] > num_split_func, then x is the index of a cold function, so we are looking up the index of a hot function. Calculate select_index = rank[x] - num_split_func.
    • If rank[x] <= num_split_func, then x is the index of a hot function, so we are looking up the index of a cold function. Calculate select_index = rank[x] + num_split_func.
  • Calculate target_index = select[select_index]. This is the index of the corresponding hot/cold function.

Example Walkthrough

Suppose the runtime function table has 5 entries with the following order:

0 | hot1
1 | hot2
2 | hot3
3 | cold1
4 | cold3

Functions 1 and 3 are split, so indices 0, 2, 3, and 4 in the bit vector should be set to 1. We also note that num_split_func = 2. Below is the bit vector:
[1, 0, 1, 1, 1]

We then calculate the rank and select tables as so:
rank = [1, 1, 2, 3, 4]
select = [X, 0, 2, 3, 4]

Now, suppose we want to find the corresponding cold function to hot1, which has an index of 0 in the runtime function table.

  • The 0th bit in the bit vector is 1, so we know such a pair exists.
  • rank[0] = 1 <= num_split_func. We calculate select_index = rank[0] + num_split_func = 1 + 2 = 3.
  • select[select_index] = select[3] = target_index = 3.
  • The function at index 3 in the runtime function table is cold1, confirming the search's correctness.

Let's try going the other way. Suppose we want to find the corresponding hot function to cold3, which has an index of 4 in the runtime function table.

  • the 4th bit in the bit vector is 1, so we know such a pair exists.
  • rank[4] = 4 > num_split_func. We calculate select_index = rank[4] - num_split_func = 4 - 2 = 2.
  • select[select_index] = select[2] = target_index = 2.
  • The function at index 2 in the runtime function table is hot3, confirming the search's correctness.

Performance

For each function in the runtime function table, we write one bit to file. Thus, for a runtime function table with N functions, N bits are written to file, plus a constant number of bits for the number of split functions; if we place an arbitrary limit on the number of functions allowed in an R2R file (232 seems unrealistically high), this constant can take less than 32 bits. Assuming a single bit per function is the information-theoretic lower bound, the data structure written to file is implicit, as it consists of N bits plus some constant.

At runtime, additional space complexity will be introduced by generating rank and select tables (structures with sizes linear on the size of the bit vector). However, the initial cost of creating these tables will be quickly offset by the search algorithm's constant time complexity.

Limitations

  • This approach assumes each function can be split at most once. This is true for now, but if we plan to support splitting functions in several places in the near future, this design may not have longevity.

Upon approval, @EugenioPena and @amanasifkhalid will lead implementation.

@jkotas @trylek @davidwrighton PTAL. Thank you!

@jkotas
Copy link
Member

jkotas commented Jun 29, 2022

Are there any alternative designs that you have considered and rejected? It would be useful to see sketches of a few alternatives and why this one was choosen as the winner. Have you looked into leveraging information in the existing unwind and gcinfo?

The lookups may be expensive if there is only a small number of hot/cold split functions or if there are large clusters of functions that are not split. The lookups do not have to be super fast, but potentially scanning nearly all methods in the image maybe too slow.

This approach assumes each function can be split at most once

This is a fine assumption to make.

@amanasifkhalid
Copy link
Member

As far as succinct data structures go, I do not know of any others that enable this mapping behavior. I have briefly considered more traditional structures that could enable constant-time lookup on average, but the additional space they would require make them impractical for writing to file. I haven't looked into the unwind/GC info blobs in the runtime function table, but if hot/cold function mappings can be inferred from here, we could skip creating a new data structure altogether and get constant-time lookup in the runtime function table. I believe @cshung didn't find enough information here to do so, though.

To somewhat lessen the cost of lookups, we could cache the last search result -- this cache would be at most 8 bytes at runtime. Would scanning all methods to set up the bit-vector and rank/select tables be too slow if it is only done once?

@BruceForstall
Copy link
Member

Is this equivalent to (1) eliminate all non-split hot functions from the list, (2) for hot function i (in the compressed list), the cold index is i + num_split_func and for cold function i, the hot index is i - num_split_func?

Related: does it require the hot and cold functions (in the "runtime function table") to be sorted equivalently? i.e., hot1/hot2/hot3/cold3/cold1 would be an illegal ordering? What are the constraints on the ordering of the runtime functions, and how does that relate to the constraints on the layout of the cold code section? The layout of the cold function fragments or cold functions might not matter (if they are truly cold). But if there is an ordering constraint implied by this design I wonder if we need to be careful in case we want to eventually have, say, a "definitely cold" and "probably cold" region, depending on how certain we are of the "coldness" of a function or function fragment.

So, if the bit vector always had two bits for every function (one for hot, one for cold), then you wouldn't need to build up the rank/select data at runtime, right?

@amanasifkhalid
Copy link
Member

Your implementation sounds valid, but I don't think we can guarantee the new data structure's indices map 1:1 to the runtime function table's indices; for example, if RuntimeFunctionTable[0] is not split but RuntimeFunctionTable[1] is split, then RuntimeFunctionTable[1] would map to index 0 in the data structure, etc. Thus we would lose constant-time lookup.

Yes, this data structure relies on the current order of the runtime function table. If I understand correctly, the runtime function table is sorted by each function's beginning RVA, so the table reflects the order of the file. So hot functions come first in order, followed by their cold counterparts in the same order (i.e. if hot1 and hot3 are both split, then cold1 comes before cold3). Should this order change, the data structure would need to change, too.

So, if the bit vector always had two bits for every function (one for hot, one for cold), then you wouldn't need to build up the rank/select data at runtime, right?

Sorry, I'm not sure if I follow this. What would each bit indicate for a function?

@BruceForstall
Copy link
Member

What would each bit indicate for a function?

I realize I was neglecting the most important thing here, which is that we have a packed, sorted-by-RVA, array of RUNTIME_FUNCTION entries, and we are actually mapping between the hot and cold RUNTIME_FUNCTION (correct?). So we can't pad the RUNTIME_FUNCTION table to match a "padded" mapping.

Note, however, that a function or hot/cold function part can have more than one RUNTIME_FUNCTION (see IsFunctionFragment()).

@amanasifkhalid
Copy link
Member

we are actually mapping between the hot and cold RUNTIME_FUNCTION (correct?)

Yes, that's correct.

Note, however, that a function or hot/cold function part can have more than one RUNTIME_FUNCTION

Interesting, thank you for bringing this up. In the context of, say, unwinding, if we are in a cold RUNTIME_FUNCTION and need to get to its corresponding hot RUNTIME_FUNCTION, there will always be only one correct RUNTIME_FUNCTION to jump to, right? As long as we can make pairs of RUNTIME_FUNCTIONs (I apologize, as I likely meant RUNTIME_FUNCTION when previously saying just "function"), I think this approach should still work.

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jun 29, 2022

Andrew just clarified with me that we don't actually intend to create the rank/select arrays at runtime or compile-time (those arrays were more for facilitating the explanation of the algorithm). Rather, we can do some operations on the bit-vector itself to calculate rank/select, as explained by Andrew below:

Rank

To support rank, you divide the bit vector into chunks of equal size, for each chunk, we store the rank of the last bit of the previous chunk (for the first chunk, we can simply skip it because the rank start with 0). For each bit, we can figure which chunk it is in easily (because chunks size is constant).
Next, we divide each chunk into subchunks of equal size and do the same, that way we can find which subchunk we are in.
The subchunks are small. so we will simply use table lookup to find the answer.
To put these into an example, here is a bit vector in hex.
FACE
Let's say we decided to have a 8 bit chunk and a 4 bit subchunk, we have these data:
Chunk #0: FA
Subchunk 0.0: F
Subchunk 0.1: A (has rank 4)
Chunk 1: CE (has rank 6)
Subchunk 1.0: C
Subchunk 1.1: E (has rank 2)
The only 3 integers we need to store is 4, 6 and 2 marked above, the other things are illustrative purpose only.
You can store a within chunk rank with only 3 bits (because a chunk has size 8) and a chunk rank with only 4 bits (because the only bit vector has only 16 bits).
Note that in the subchunk, we store the "within chunk" rank.
Using that data, to compute rank of the 14th bit, we figure it is in the chunk #1, and it is in the subchunk #1.1, so the rank of the bit 11 is 6 + 2 = 8
E is 1110, therefore, the 14th bit has rank 8 + 3 = 11, this step is meant to be done by a look up table like this:
Subchunk content, #which bit within subchunk, answer
...
E, 2, 3
...
That conclude how we will implement rank.

Select

To support select, we use a similar idea of dividing the bit vector into groups of equals number of ones. Depending on the input, these groups will have the same number of ones, but different number of zeros, so they have different length.
For each group except the first one, we store the start of the group.
Depending on the group size, we can do one of the following:

  • If the group is tiny, we can solve the problem with a lookup table.
  • If the group is big, we can store the indexes of the individual ones in the group.
  • If the group is not small enough and not big enough, we can subdivide it into subgroups.
    Again, let's go with an example:
    The bit vector again is FACE, and let's say the group size is 5 ones.
    | | |
    1111 1010 1100 1110
    0123 4567 89AB CDEF
    The bars denote where the group starts:
    Now we know
    group 0 starts at bit 0
    group 1 starts at bit 5
    group 2 starts at bit 14
    If I wanted to know the position of 9th one, we know it has to be in the second group, and assume this is a big group so we stored the indexes of the individual ones, now we know the 4th bit within group 1 has a within group position 7, so together with the fact that the position of the second group is 5, we know the 9th bit is at position 12.
    We will skip the other variants, assuming it is not too hard to understand the other cases.

Why do we do this?

These indexes provide us with a constant lookup time. Every time when we perform an operation, we simply determine which chunk/subchunk group/subgroup and perform just a few table lookup for the solution.
The only thing we worry about with this algorithm is the size of the tables. The tedious math involved tells us that if we implement the chunk size as required, the overall space for these bits is o(n). (Small-o, not Big-O)

@cshung
Copy link
Member Author

cshung commented Jun 29, 2022

@davidwrighton
Copy link
Member

I've been thinking about this for a bit and I believe there are several things to consider.

  1. What percentage of the bits will be 1 in a typical situation? I think this is all reasonable, but its clearly a fairly decent encoding assuming the number of hot/cold pairs is something like 5% of the file.

  2. I think the idea of working with a bit vector is fairly decent, but we need to concern ourselves with the load time behavior of this data structure, as well as the lookup time behavior. Load time performance for R2R images is extremely important, even more important than the performance of the generated code. Thus, I think we should make sure that whatever structure we use is encoded entirely in the PE file and does not require computation at runtime to setup.

  3. One important detail to remember when working with this sort of code is that while algorithms are often analyzed in terms of asymptotic notation, researchers often work with much larger data sets than are contained in a PE file. This can change the analysis of algorithms substantially. For instance, we know that this bit vector must represent RUNTIME_FUNCTION entries in a PE file. We also know that a fairly practical upper limit on the number of entries that we will need to track is on the order of 1 million.

In looking at the algorithm, I see that generating the actual bit vector is trivial, and the logical rank/select operations look reasonable for use. The question becomes, how to efficiently implement the rank and select algorithms and whether or not they actually need to be constant, or if some sort of tree structure is acceptable instead.

Rank data structure

Looking at this, a chunking approach, where we chunk on 32byte or 64 byte boundaries makes sense. We can easily create the chunks, and encoding them in the file is again, trivial (Its just another parallel array to the bit vector). However, the algorithm described above has this concept of subchunks, and I do not see significant value to carrying an extra structure for subchunks. Instead, I would take advantage of the population count instructions of the hardware, which will allow (on a 64bit platform) computing the subchunk rank in something like 24 cycles. Not all hardware we run on actually has support for these instructions, so we would need a fallback routine, but their presence is common enough that we can assume the instructions exist.

Select data structure

This looks much more complicated and expensive to build, search, and maintain, and the benefit over doing a binary search through the chunk rank array seems minimal. I would suggest considering an algorithm which did the select operation by binary searching the rank chunk array to find the check where the select will succeed, and then implementing a bit scanning algorithm to find the appropriate index.

@amanasifkhalid
Copy link
Member

Thank you for the detailed feedback, @davidwrighton! I'll do my best to address your points:

What percentage of the bits will be 1 in a typical situation?

I don't have a comprehensive answer for this yet, but I do have some preliminary metrics that reveal how often the JIT splits code from various SuperPMI collections here; note that this is with splitting of EH funclets turned on. On the low end, the JIT split ~14% of functions in the benchmarks collection, which consists largely of "C-like" functions that don't leverage C# constructs and EH, thus making them less conducive to the JIT's splitting heuristics. On the high end, over 26% of functions in the asp.net collection were split; this collection has some PGO data that was leveraged by the JIT.

Thus, I think we should make sure that whatever structure we use is encoded entirely in the PE file and does not require computation at runtime to setup.

You touch on this later, but I agree emitting the rank array to file should be trivial. I initially prioritized minimizing the amount of new data written to the R2R format, but if startup time is our top priority, then it makes sense to write the rank structure to file.

Rank data structure

I like your idea of foregoing subchunks in favor of using the PopCount instruction. Just to clarify, do you mean using 32/64-bit chunk sizes, or 32/64-byte chunks?

Select data structure

I also agree the select data structure and algorithm may not be worth its complexity for our purposes. Binary search over the rank structure would be intuitive from a development standpoint, and would save us from creating and loading another data structure. I guess this raises the question of whether there are other data structures capable of binary search that are also worth our consideration, though I do not know of any that are as succinct.

With your suggestions in mind, I imagine a workflow like this:

  • Emit the bit vector to file.
  • Calculate the rank of each 32/64-bit chunk, and emit an array of ranks to file.
  • At runtime, calculate rank(i) as so:
    • Look up rank_array[x], where x is the index of the chunk preceding position i.
    • Let bitmask be the chunk containing position i, where all bits after i have been set to zero.
    • Return rank_array[x] + PopCount(bitmask).
  • Calculate select(i) as so:
    • Search for the chunk index x for which rank_array[x] < i, and rank_array[x + 1] >= i.
    • Within this chunk, do a bit-scan for the appropriate index.

Thanks again for the feedback! I'm curious to hear what @cshung and @EugenioPena think.

@davidwrighton
Copy link
Member

@amanasifkhalid I would suggest 32 or 64 byte chunks. With the popcount instruction on X86/the equivalent NEON instructions on Arm, it becomes extremely efficient and fast to count that sort of number of bits.

As far as more efficient layout for binary searching, we could use an Eytzinger layout for the rank array data, but then it becomes more complex to implement the rank algorithm as we want to use the sorted index for the lookup there, and translating from a sorted index into Eytzinger layout index is non-trivial. My expectation is that we don't run these operations sufficiently often enough for it to be worth the complexity. At a chunk size of 64 bytes, if there are 1,000,000 entries then we have about 2000 chunks to search through and, the log base 2 is something like 11, which is probably reasonably cheap.

Alternatively, we could add a PGM index on top of the sorted array. That would be interesting, but again, I doubt that this data structure is hot enough for the complexity of a PGM index to really be worth it. See https://pgm.di.unipi.it/ I suspect that if we actually were to find value in one of these, it would be to address the cost of the lookup of a RUNTIME_FUNCTION in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-hot-cold-splitting Hot cold splitting support in crossgen2
Projects
None yet
Development

No branches or pull requests

5 participants