-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate hot counters #6
base: v2.1
Are you sure you want to change the base?
Conversation
Function hot counter have been moved from the shared hashtable to inside the function prototype object that is only shared between the closures of a function. The function hot counter field was added to the end of the GCproto struct so that its right next the function header bytecode in memory which decrements the counter. Since the function hot counters are now spread out around the function proto objects it will increase the number of dirty cache lines that need to be eventually be flushed back to memory vs only two cache lines of the shared hotcount hashtable, but this stops happening once the function is JIT'ed. Updated the interpreter assembly for the Lua function header bytecodes to use the new hot counter location. Added a separate JIT param for function hot trigger count currently set to the same effective value as it was with the shared hotcount table system of 112 calls.
…unters The loop hot count will be stored in the 16 bit D part of the opcode and the opcode is emitted directly after either BC_LOOP, BC_FORL or BC_ITERL depending on what kind of loop its being generated. These three opcodes do the hot counting for loops so the hot count should be right next to them in memory. Update embedded bytecode in buildvm_libbc.h for bytecode changing from BC_LOOPHC
…TERL\BC_JITERL Skipping for BC_FORI NYI
…LOOPHC to store the count
…ake them JIT params Make the loop hot counters decrement by one instead of two for consistently now that the counter backoff scaling values are not shared with functions. Use some non default max trace attempt counts for loops and functions in the unit tests to test out penaltymaxfunc and penaltymaxloop JIT parameters
…other arch builds Add BC_LOOPHC under a define in lj_bc.h and lj_record.c Don't embed hot counter bytecode if not built with separate hot counters Scale penalty values by 2 when separate counters is disabled
nice! on a first glance it seems to be controlled by a compile flag, right? |
Well its on by default for x86\x64 but can be disabled with LUAJIT_NO_SEPARATE_COUNTERS. The hot counter tests should still pass with it disabled. |
src/lj_record.c
Outdated
lj_trace_flush(J, lnk); /* Flush trace that only returns. */ | ||
/* Set a small, pseudo-random hotcount for a quick retry of JFUNC*. */ | ||
hotcount_set(J2GG(J), J->pc+1, LJ_PRNG_BITS(J, 4)); | ||
hotcount_set_pt(J2GG(J), pt, J->pc, LJ_PRNG_BITS(J, 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is weird here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it changed from a tab to spaces but the indent is the same if you set tabsize to 8 like LuaJIT uses. If there was .editorconfig in the repo github would display this better.
@@ -5,3 +5,4 @@ gc64_slot_revival.lua | |||
phi | |||
snap.lua | |||
stitch.lua | |||
hotcounters.lua |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline at EOL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
This seems like an excellent change to me. Being able to exclude hot-counter table collisions as the root cause during "WTF?" performance troubleshooting would be a significant productivity win. I don't think that the hashtable has ever directly caused a performance problem that I have investigated but even so it has been a productivity drag to have to consider the possibility and try to rule it out. Great work! |
We only saw the big performance swings in a few of our benchmarks. If LuaJIT has trouble tracing the code it seems to pronounce the effect though. The random collisions do sometimes cause different traces that make things faster but its not really something you can control. |
I would be enthusiastic about this change even if it doesn't impact any benchmarks. This should make users more productive because it means one less possibility to consider/exclude when looking for the root cause of surprising performance problems. |
Corollary is that since this feature is only implemented and enabled for x86 platforms this might mean one more possibility to consider when looking for the root cause of different JIT behaviour on different platforms. |
As an outsider, is anything going to be done with this pull request (@lukego is this in RaptorJIT?), or is it just going to languish here? |
@ZirconiumX RaptorJIT will adopt this change. The easiest way to do that is probably to fold it into the ongoing port of the VM from assembler to C though (raptorjit/raptorjit#199). Could also merge it more quickly, but my testing at raptorjit/raptorjit#56 makes me think this change is not super urgent (but definitely important.) |
Mea culpa. I hope to carve some time this week to apply this and also some fixes from upstream. |
Huh... I seem to recall Mike Pall had an explanation why he didn't originally do it this way (in short: cache locality; the hash table is designed to fit into a single cache line). Has anyone measured the effect of this change? |
This is a good question: it makes sense to know the cost of a change. But I also think you need to decide up front what you are optimizing for. Is it more important to make the JIT more predictable or to make the interpreter faster? RaptorJIT perspective is that making compilation predictable to application developers is the number one priority. If you understand the JIT then your code will always compile and you won't use the interpreter anyway. This is why we are enthusiastic about purging all probabilistic algorithms from the JIT heuristics. We see it as a major time suck when troubleshooting performance problems and having to consider that maybe there is no root cause in the source code and the problem is just dumb bad luck. So we'd want this change even if it did add a cache-miss here and there during warmup because that's not our pain point. |
@lukego wrote:
Well... I guess. I initially wanted to write that with JIT compilation, the behaviour in any case predictable only for the same input (data, I/O timings etc.), and the hash table doesn’t in itself include any genuine randomness. But I guess the argument still works if you replace “predictable” by “locally predictable”, as in the JIT behaviour of a loop should not depend on literally everything else in the program. (Might still make sense to bunch the counters together somehow, like in a global or per-file table, but I have no actual idea.) However, my original comment was less implicit disapproval and more genuine curiosity, as the original reasoning—“because cache locality”—is one of those things that sound eminently sensible, but could always turn out to be utterly negligible in practice. So I think it’d be actually very interesting to know how it turns out (and I suck at benchmarking). P.S. I don’t know how, but for some reason I couldn’t originally find the description of the hash table despite it being literally two browser tabs over. Here it is:
|
i guess cache locality is a factor against a bigger hash table, but couldn't this have even better cache use? If I recall correctly, this puts the counters in the bytecode, which is definitely in cache when it's being interpreted, and doesn't try to put entries relevant to other, far distant, parts of the code all together. in the end, benchmarks trump any amount of guessing. |
Yeah i put the counter value for functions right next to the function header entry bytecode in memory and for loops i put the counter value after loop hot counting instructions FORL, ITERL and LOOP. |
Just indulging in a bit of napkin math... Suppose your program has 1000 LOOP/FUNC bytecodes with hotcounts, and each one is executed on average 100 times before being JITed or blacklisted, and each hotcount update costs you 10 cycles. Then the total cost during warmup would be one million CPU cycles. On a 2GHz CPU that would be 500 microseconds of delay over the lifetime of the program. |
This PR tries to make performance more consistent by switching the hot counters used by function and loops from being shared in a small hash table to having a separate counter for each loop or function. This will make which function\loops are considered hot enough to start a trace consistent with how often they are called, instead of being somewhat random based on collisions in the hash table that change each time the process is run. For small benchmarks this effect is rarely seen but in larger Lua code bases we've seen performance oscillate from one process execution to the next by up to 20% that stopped once we made the hot counters separate.
Normally the 16 bit hot counters for functions and loops are in a small 64 slot hashtable with the key being the address of bytecode doing the hot counting which is either the first bytecode of a function or the loop bytecodes BC_LOOP, BC_FORL, BC_ITERL. This PR moves hot count for functions to a field at the end of the function prototype(GCproto) struct and for loops to a new dummy bytecode(BC_LOOPHC) that is emitted after the loop bytecodes listed earlier to store the hot count of loops.