Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the outer lexer loop. #3140

Merged
merged 7 commits into from
Aug 23, 2023

Conversation

chandlerc
Copy link
Contributor

Previously, the code would try each form of lexing and let that
sub-lexing routine reject the code. This was very branch heavy and also
hard to optimize -- lots of hard to inline function calls, etc.

However, it's really nice to keep the different categories of lexing
broken out into their own functions rather than flattening this into
a huge state machine.

So this creates a miniature explicit state machine by building a table
of function pointers that wrap the methods on the lexer. The main lexing
loop simply indexes this table with the first byte of the source code,
and calls the dispatch function pointer returned.

The result is that the main lex loop is incredibly tight code, and the
only time spent appears to be stalls waiting on memory for the next
byte. =]

As part of this, optimize symbol lexing specifically by recognizing all
the symbols that are exactly one character -- IE, we don't even need to
look at the next character, there is no max-munch or anything else.
For these, we pre-detect the exact token kind and hand that into the
symbol lexing routine to avoid re-computing it. The symbols in this
category are really frequent symbols in practice like , and ;, so
this seems likely worthwhile in practice.

The one-byte-dispatch should also be reasonably extendable in the
future. For example, I suspect this is the likely hot-path for non-ASCII
lexing, where we see the UTF-8 marker at the start of a token and most
(if not all) of the token is non-ASCII. We can use this table to
dispatch immediately to an optimized routine dedicated to UTF-8
processing, without any slowdown for other inputs.

The benchmark results are best for keyword lexing because that is the
fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other
improvements are less dramatic, but I think this is still worthwhile
because it gives a really strong basis for both expanded functionality
without performance hit and further optimizations.

This adds a benchmark that tries to synthesize mixtures of symbols,
keywords, and identifiers. It tweaks the distribution of identifier
lengths based on some empirical measurements of, for example LLVM's
codebase.

Also establish a framework for skewing the symbol distribution, although
that one is based entirely on intuition and not measurements. It should
be adjusted as we have measurements.

The ratios between symbols, keywords, and identifiers is also
unmeasured, but several different ratios are covered.

Neither literals nor grouping symbols are included yet, as both present
some additional challenges in forming them, and this seemed like
a plausible increment in expanding the benchmark coverage.
Previously, the code would try each form of lexing and let that
sub-lexing routine reject the code. This was very branch heavy and also
hard to optimize -- lots of hard to inline function calls, etc.

However, it's really nice to keep the different categories of lexing
broken out into their own functions rather than flattening this into
a huge state machine.

So this creates a miniature explicit state machine by building a table
of function pointers that wrap the methods on the lexer. The main lexing
loop simply indexes this table with the first byte of the source code,
and calls the dispatch function pointer returned.

The result is that the main lex loop is *incredibly* tight code, and the
only time spent appears to be stalls waiting on memory for the next
byte. =]

As part of this, optimize symbol lexing specifically by recognizing all
the symbols that are exactly one character -- IE, we don't even need to
look at the *next* character, there is no max-munch or anything else.
For these, we pre-detect the exact token kind and hand that into the
symbol lexing routine to avoid re-computing it. The symbols in this
category are really frequent symbols in practice like `,` and `;`, so
this seems likely worthwhile in practice.

The one-byte-dispatch should also be reasonably extendable in the
future. For example, I suspect this is the likely hot-path for non-ASCII
lexing, where we see the UTF-8 marker at the start of a token and most
(if not all) of the token is non-ASCII. We can use this table to
dispatch immediately to an optimized routine dedicated to UTF-8
processing, without any slowdown for other inputs.

The benchmark results are best for keyword lexing because that is the
fastest thing to lex -- it goes form 25 mt/s to 30 mt/s. Other
improvements are less dramatic, but I think this is still worthwhile
because it gives a really strong basis for both expanded functionality
without performance hit and further optimizations.
@chandlerc
Copy link
Contributor Author

Note that with this change, there is some sneaky performance hiding in these dispatch lambdas.

For ... various reasons ... LLVM chooses not to inline the functions into them, despite there being no other callers of these functions.

The result is that each of these lambdas turns into an actual function that tail calls into the relevant member function. This is still an improvement, but it isn't free or optimal.

I'm curious how folks would prefer I address this... I see several options...

  1. Force inlining -- simple. But would have code size impact for symbols where we generate more meaningful thunks. We could try to leave just those.

  2. Restructure to homogeneous member functions and use a table of pointers-to-member.

  3. Restructure to homogeneous free functions.

#2 and #3 would both also involve passing the token, even when it is just a placeholder, and likely get some performance without code duplication that seems hard to get in #1.

Also interested in other ideas here. But to be clear, I'm mostly thinking of these as fix-forwards. This PR is still a strict improvement, I just finally dug through the profile enough to see some of the costs embedded here.

@chandlerc chandlerc requested a review from jonmeow August 23, 2023 17:53
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Show resolved Hide resolved
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL!

toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Show resolved Hide resolved
toolchain/lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
@zygoloid zygoloid added this pull request to the merge queue Aug 23, 2023
Merged via the queue into carbon-language:trunk with commit b2db2ca Aug 23, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants