-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge lexer from the toolchain repository. #213
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Main original commit message: > Add stubs of a diagnostic emission library. (#13) > > This doesn't have much wired up yet, but tries to lay out the most > primitive API pattern.
The only change here is to update the fuzzer build extension path. The main original commit message: > Add an initial lexer. (#17) > > The specific logic here hasn't been updated to track the latest > discussed changes, much less implement many aspects of things like > Unicode support. > > However, this should lay out a reasonable framework and set of APIs. > It gives an idea of the overall lexer architecture being proposed. The > actual lexing algorithm is a relatively boring and naive hand written > loop. It may make sense to replace this with something generated or > other more advanced approach in the future, getting the implementation > right was not the primary goal here. Instead, the focus was entirely > on the architecture, encapsulation, APIs, and the testing > infrastructure. > > The architecture of the lexer differs from "classical" high > performance lexers in compilers. A high level summary: > > - It is eager rather than lazy, lexing an entire file. > - Tokens intrinsically know their source location. > - Grouping lexical symbols are tracked within the lexer. > - Indentation is tracked within the lexer. > > Tracking of grouping and indentation is intended to simplify the > strategies used for recovery of mismatched grouping tokens, and > eventually use indentation. > > Folding source location into the token itself simplifies the data > structures significantly, and doesn't lose any fidelity due to the > absence of a preprocessor with token pasting. > > The fact that this is an eager lexer instead of a lazy lexer is > designed to simplify the implementation and testing of the lexer (and > subsequent components). There is no reason to expect Carbon to lex so > many tokens that there are significant locality advantages of lazy > lexing. Moreover, if we want comparable performance benefits, I think > pipelining is a much more promising architecture than laziness. For > now, the simplicity is a huge win. > > Being eager also makes it easy for us to use extremely dense memory > encodings for the information about lexed tokens. Everything is > created in a dense array, and small indices are used to identify each > token within the array. > > There is a fuzzer included here that we have run extensively over the > code, but currently toolchain bugs and Bazel limitations prevent it > from easily building. I'm hoping myself or someone else can push on > this soon and enable the fuzzer to at least build if not run fuzz > tests automatically. We have a significant fuzzing corpus that I'll > add in a subsequent commit as well. This also includes the fuzzer whose commit message was: > Add fuzz testing infrastructure and the lexer's fuzzer. (#21) > > This adds a fairly simple `cc_fuzz_test` macro that is specialized for > working with LLVM's LibFuzzer. In addition to building the fuzzer > binary with the toolchain's `fuzzer` feature, it also sets up the test > execution to pass the corpus as file arguments which is a simple > mechanism to enable regression testing against the fuzz corpus. > > I've included an initial fuzzer corpus as well. To run the fuzzer in > an open ended fashion, and build up a larger corpus: > ```shell > mkdir /tmp/new_corpus > cp lexer/fuzzer_corpus/* /tmp/new_corpus > ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus > ``` > > You can parallelize the fuzzer by adding `-jobs=N` for N threads. For > more details about running fuzzers, see the documentation: > http://llvm.org/docs/LibFuzzer.html > > To minimize and merge any interesting new inputs: > ```shell > ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \ > lexer/fuzzer_corpus /tmp/new_corpus > ``` Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
jonmeow
approved these changes
Dec 7, 2020
chandlerc
added a commit
that referenced
this pull request
Jun 28, 2022
The only change here is to update the fuzzer build extension path. The main original commit message: > Add an initial lexer. (#17) > > The specific logic here hasn't been updated to track the latest > discussed changes, much less implement many aspects of things like > Unicode support. > > However, this should lay out a reasonable framework and set of APIs. > It gives an idea of the overall lexer architecture being proposed. The > actual lexing algorithm is a relatively boring and naive hand written > loop. It may make sense to replace this with something generated or > other more advanced approach in the future, getting the implementation > right was not the primary goal here. Instead, the focus was entirely > on the architecture, encapsulation, APIs, and the testing > infrastructure. > > The architecture of the lexer differs from "classical" high > performance lexers in compilers. A high level summary: > > - It is eager rather than lazy, lexing an entire file. > - Tokens intrinsically know their source location. > - Grouping lexical symbols are tracked within the lexer. > - Indentation is tracked within the lexer. > > Tracking of grouping and indentation is intended to simplify the > strategies used for recovery of mismatched grouping tokens, and > eventually use indentation. > > Folding source location into the token itself simplifies the data > structures significantly, and doesn't lose any fidelity due to the > absence of a preprocessor with token pasting. > > The fact that this is an eager lexer instead of a lazy lexer is > designed to simplify the implementation and testing of the lexer (and > subsequent components). There is no reason to expect Carbon to lex so > many tokens that there are significant locality advantages of lazy > lexing. Moreover, if we want comparable performance benefits, I think > pipelining is a much more promising architecture than laziness. For > now, the simplicity is a huge win. > > Being eager also makes it easy for us to use extremely dense memory > encodings for the information about lexed tokens. Everything is > created in a dense array, and small indices are used to identify each > token within the array. > > There is a fuzzer included here that we have run extensively over the > code, but currently toolchain bugs and Bazel limitations prevent it > from easily building. I'm hoping myself or someone else can push on > this soon and enable the fuzzer to at least build if not run fuzz > tests automatically. We have a significant fuzzing corpus that I'll > add in a subsequent commit as well. This also includes the fuzzer whose commit message was: > Add fuzz testing infrastructure and the lexer's fuzzer. (#21) > > This adds a fairly simple `cc_fuzz_test` macro that is specialized for > working with LLVM's LibFuzzer. In addition to building the fuzzer > binary with the toolchain's `fuzzer` feature, it also sets up the test > execution to pass the corpus as file arguments which is a simple > mechanism to enable regression testing against the fuzz corpus. > > I've included an initial fuzzer corpus as well. To run the fuzzer in > an open ended fashion, and build up a larger corpus: > ```shell > mkdir /tmp/new_corpus > cp lexer/fuzzer_corpus/* /tmp/new_corpus > ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus > ``` > > You can parallelize the fuzzer by adding `-jobs=N` for N threads. For > more details about running fuzzers, see the documentation: > http://llvm.org/docs/LibFuzzer.html > > To minimize and merge any interesting new inputs: > ```shell > ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \ > lexer/fuzzer_corpus /tmp/new_corpus > ``` Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The only change here is to update the fuzzer build extension path.
The main original commit message:
This also includes the fuzzer whose commit message was:
Co-authored-by: Jon Meow 46229924+jonmeow@users.noreply.github.com