Optimize the outer lexer loop. #3140

chandlerc · 2023-08-23T02:58:14Z

Previously, the code would try each form of lexing and let that
sub-lexing routine reject the code. This was very branch heavy and also
hard to optimize -- lots of hard to inline function calls, etc.

However, it's really nice to keep the different categories of lexing
broken out into their own functions rather than flattening this into
a huge state machine.

So this creates a miniature explicit state machine by building a table
of function pointers that wrap the methods on the lexer. The main lexing
loop simply indexes this table with the first byte of the source code,
and calls the dispatch function pointer returned.

The result is that the main lex loop is incredibly tight code, and the
only time spent appears to be stalls waiting on memory for the next
byte. =]

As part of this, optimize symbol lexing specifically by recognizing all
the symbols that are exactly one character -- IE, we don't even need to
look at the next character, there is no max-munch or anything else.
For these, we pre-detect the exact token kind and hand that into the
symbol lexing routine to avoid re-computing it. The symbols in this
category are really frequent symbols in practice like , and ;, so
this seems likely worthwhile in practice.

The one-byte-dispatch should also be reasonably extendable in the
future. For example, I suspect this is the likely hot-path for non-ASCII
lexing, where we see the UTF-8 marker at the start of a token and most
(if not all) of the token is non-ASCII. We can use this table to
dispatch immediately to an optimized routine dedicated to UTF-8
processing, without any slowdown for other inputs.

The benchmark results are best for keyword lexing because that is the
fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other
improvements are less dramatic, but I think this is still worthwhile
because it gives a really strong basis for both expanded functionality
without performance hit and further optimizations.

This adds a benchmark that tries to synthesize mixtures of symbols, keywords, and identifiers. It tweaks the distribution of identifier lengths based on some empirical measurements of, for example LLVM's codebase. Also establish a framework for skewing the symbol distribution, although that one is based entirely on intuition and not measurements. It should be adjusted as we have measurements. The ratios between symbols, keywords, and identifiers is also unmeasured, but several different ratios are covered. Neither literals nor grouping symbols are included yet, as both present some additional challenges in forming them, and this seemed like a plausible increment in expanding the benchmark coverage.

Previously, the code would try each form of lexing and let that sub-lexing routine reject the code. This was very branch heavy and also hard to optimize -- lots of hard to inline function calls, etc. However, it's really nice to keep the different categories of lexing broken out into their own functions rather than flattening this into a huge state machine. So this creates a miniature explicit state machine by building a table of function pointers that wrap the methods on the lexer. The main lexing loop simply indexes this table with the first byte of the source code, and calls the dispatch function pointer returned. The result is that the main lex loop is *incredibly* tight code, and the only time spent appears to be stalls waiting on memory for the next byte. =] As part of this, optimize symbol lexing specifically by recognizing all the symbols that are exactly one character -- IE, we don't even need to look at the *next* character, there is no max-munch or anything else. For these, we pre-detect the exact token kind and hand that into the symbol lexing routine to avoid re-computing it. The symbols in this category are really frequent symbols in practice like `,` and `;`, so this seems likely worthwhile in practice. The one-byte-dispatch should also be reasonably extendable in the future. For example, I suspect this is the likely hot-path for non-ASCII lexing, where we see the UTF-8 marker at the start of a token and most (if not all) of the token is non-ASCII. We can use this table to dispatch immediately to an optimized routine dedicated to UTF-8 processing, without any slowdown for other inputs. The benchmark results are best for keyword lexing because that is the fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other improvements are less dramatic, but I think this is still worthwhile because it gives a really strong basis for both expanded functionality without performance hit and further optimizations.

chandlerc · 2023-08-23T07:20:42Z

Note that with this change, there is some sneaky performance hiding in these dispatch lambdas.

For ... various reasons ... LLVM chooses not to inline the functions into them, despite there being no other callers of these functions.

The result is that each of these lambdas turns into an actual function that tail calls into the relevant member function. This is still an improvement, but it isn't free or optimal.

I'm curious how folks would prefer I address this... I see several options...

Force inlining -- simple. But would have code size impact for symbols where we generate more meaningful thunks. We could try to leave just those.
Restructure to homogeneous member functions and use a table of pointers-to-member.
Restructure to homogeneous free functions.

#2 and #3 would both also involve passing the token, even when it is just a placeholder, and likely get some performance without code duplication that seems hard to get in #1.

Also interested in other ideas here. But to be clear, I'm mostly thinking of these as fix-forwards. This PR is still a strict improvement, I just finally dug through the profile enough to see some of the costs embedded here.

toolchain/lexer/tokenized_buffer.cpp

Co-authored-by: Richard Smith <richard@metafoo.co.uk>

chandlerc

Thanks, PTAL!

toolchain/lexer/tokenized_buffer.cpp

chandlerc added 2 commits August 23, 2023 01:39

github-actions bot added the toolchain label Aug 23, 2023

github-actions bot requested a review from zygoloid August 23, 2023 02:58

chandlerc requested a review from jonmeow August 23, 2023 17:53

Merge branch 'trunk' into opt-lex-dispatch

9118ea4

zygoloid reviewed Aug 23, 2023

View reviewed changes

chandlerc and others added 4 commits August 23, 2023 11:20

Apply suggestions from code review

03d85d4

Co-authored-by: Richard Smith <richard@metafoo.co.uk>

address review comments

b0c9849

fixes

1bcabf2

more fixes

ed03121

chandlerc commented Aug 23, 2023

View reviewed changes

zygoloid approved these changes Aug 23, 2023

View reviewed changes

zygoloid added this pull request to the merge queue Aug 23, 2023

Merged via the queue into carbon-language:trunk with commit b2db2ca Aug 23, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the outer lexer loop. #3140

Optimize the outer lexer loop. #3140

chandlerc commented Aug 23, 2023

chandlerc commented Aug 23, 2023

chandlerc left a comment

Optimize the outer lexer loop. #3140

Optimize the outer lexer loop. #3140

Conversation

chandlerc commented Aug 23, 2023

chandlerc commented Aug 23, 2023

chandlerc left a comment

Choose a reason for hiding this comment