-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the outer lexer loop. #3140
Conversation
This adds a benchmark that tries to synthesize mixtures of symbols, keywords, and identifiers. It tweaks the distribution of identifier lengths based on some empirical measurements of, for example LLVM's codebase. Also establish a framework for skewing the symbol distribution, although that one is based entirely on intuition and not measurements. It should be adjusted as we have measurements. The ratios between symbols, keywords, and identifiers is also unmeasured, but several different ratios are covered. Neither literals nor grouping symbols are included yet, as both present some additional challenges in forming them, and this seemed like a plausible increment in expanding the benchmark coverage.
Previously, the code would try each form of lexing and let that sub-lexing routine reject the code. This was very branch heavy and also hard to optimize -- lots of hard to inline function calls, etc. However, it's really nice to keep the different categories of lexing broken out into their own functions rather than flattening this into a huge state machine. So this creates a miniature explicit state machine by building a table of function pointers that wrap the methods on the lexer. The main lexing loop simply indexes this table with the first byte of the source code, and calls the dispatch function pointer returned. The result is that the main lex loop is *incredibly* tight code, and the only time spent appears to be stalls waiting on memory for the next byte. =] As part of this, optimize symbol lexing specifically by recognizing all the symbols that are exactly one character -- IE, we don't even need to look at the *next* character, there is no max-munch or anything else. For these, we pre-detect the exact token kind and hand that into the symbol lexing routine to avoid re-computing it. The symbols in this category are really frequent symbols in practice like `,` and `;`, so this seems likely worthwhile in practice. The one-byte-dispatch should also be reasonably extendable in the future. For example, I suspect this is the likely hot-path for non-ASCII lexing, where we see the UTF-8 marker at the start of a token and most (if not all) of the token is non-ASCII. We can use this table to dispatch immediately to an optimized routine dedicated to UTF-8 processing, without any slowdown for other inputs. The benchmark results are best for keyword lexing because that is the fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other improvements are less dramatic, but I think this is still worthwhile because it gives a really strong basis for both expanded functionality without performance hit and further optimizations.
Note that with this change, there is some sneaky performance hiding in these dispatch lambdas. For ... various reasons ... LLVM chooses not to inline the functions into them, despite there being no other callers of these functions. The result is that each of these lambdas turns into an actual function that tail calls into the relevant member function. This is still an improvement, but it isn't free or optimal. I'm curious how folks would prefer I address this... I see several options...
#2 and #3 would both also involve passing the token, even when it is just a placeholder, and likely get some performance without code duplication that seems hard to get in #1. Also interested in other ideas here. But to be clear, I'm mostly thinking of these as fix-forwards. This PR is still a strict improvement, I just finally dug through the profile enough to see some of the costs embedded here. |
Co-authored-by: Richard Smith <richard@metafoo.co.uk>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL!
Previously, the code would try each form of lexing and let that
sub-lexing routine reject the code. This was very branch heavy and also
hard to optimize -- lots of hard to inline function calls, etc.
However, it's really nice to keep the different categories of lexing
broken out into their own functions rather than flattening this into
a huge state machine.
So this creates a miniature explicit state machine by building a table
of function pointers that wrap the methods on the lexer. The main lexing
loop simply indexes this table with the first byte of the source code,
and calls the dispatch function pointer returned.
The result is that the main lex loop is incredibly tight code, and the
only time spent appears to be stalls waiting on memory for the next
byte. =]
As part of this, optimize symbol lexing specifically by recognizing all
the symbols that are exactly one character -- IE, we don't even need to
look at the next character, there is no max-munch or anything else.
For these, we pre-detect the exact token kind and hand that into the
symbol lexing routine to avoid re-computing it. The symbols in this
category are really frequent symbols in practice like
,
and;
, sothis seems likely worthwhile in practice.
The one-byte-dispatch should also be reasonably extendable in the
future. For example, I suspect this is the likely hot-path for non-ASCII
lexing, where we see the UTF-8 marker at the start of a token and most
(if not all) of the token is non-ASCII. We can use this table to
dispatch immediately to an optimized routine dedicated to UTF-8
processing, without any slowdown for other inputs.
The benchmark results are best for keyword lexing because that is the
fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other
improvements are less dramatic, but I think this is still worthwhile
because it gives a really strong basis for both expanded functionality
without performance hit and further optimizations.