Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge lexer from the toolchain repository. #213

Merged
merged 3 commits into from
Dec 8, 2020
Merged

Merge lexer from the toolchain repository. #213

merged 3 commits into from
Dec 8, 2020

Conversation

chandlerc
Copy link
Contributor

The only change here is to update the fuzzer build extension path.

The main original commit message:

Add an initial lexer. (#17)

The specific logic here hasn't been updated to track the latest
discussed changes, much less implement many aspects of things like
Unicode support.

However, this should lay out a reasonable framework and set of APIs.
It gives an idea of the overall lexer architecture being proposed. The
actual lexing algorithm is a relatively boring and naive hand written
loop. It may make sense to replace this with something generated or
other more advanced approach in the future, getting the implementation
right was not the primary goal here. Instead, the focus was entirely
on the architecture, encapsulation, APIs, and the testing
infrastructure.

The architecture of the lexer differs from "classical" high
performance lexers in compilers. A high level summary:

  • It is eager rather than lazy, lexing an entire file.
  • Tokens intrinsically know their source location.
  • Grouping lexical symbols are tracked within the lexer.
  • Indentation is tracked within the lexer.

Tracking of grouping and indentation is intended to simplify the
strategies used for recovery of mismatched grouping tokens, and
eventually use indentation.

Folding source location into the token itself simplifies the data
structures significantly, and doesn't lose any fidelity due to the
absence of a preprocessor with token pasting.

The fact that this is an eager lexer instead of a lazy lexer is
designed to simplify the implementation and testing of the lexer (and
subsequent components). There is no reason to expect Carbon to lex so
many tokens that there are significant locality advantages of lazy
lexing. Moreover, if we want comparable performance benefits, I think
pipelining is a much more promising architecture than laziness. For
now, the simplicity is a huge win.

Being eager also makes it easy for us to use extremely dense memory
encodings for the information about lexed tokens. Everything is
created in a dense array, and small indices are used to identify each
token within the array.

There is a fuzzer included here that we have run extensively over the
code, but currently toolchain bugs and Bazel limitations prevent it
from easily building. I'm hoping myself or someone else can push on
this soon and enable the fuzzer to at least build if not run fuzz
tests automatically. We have a significant fuzzing corpus that I'll
add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

Add fuzz testing infrastructure and the lexer's fuzzer. (#21)

This adds a fairly simple cc_fuzz_test macro that is specialized for
working with LLVM's LibFuzzer. In addition to building the fuzzer
binary with the toolchain's fuzzer feature, it also sets up the test
execution to pass the corpus as file arguments which is a simple
mechanism to enable regression testing against the fuzz corpus.

I've included an initial fuzzer corpus as well. To run the fuzzer in
an open ended fashion, and build up a larger corpus:

mkdir /tmp/new_corpus
cp lexer/fuzzer_corpus/* /tmp/new_corpus
./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus

You can parallelize the fuzzer by adding -jobs=N for N threads. For
more details about running fuzzers, see the documentation:
http://llvm.org/docs/LibFuzzer.html

To minimize and merge any interesting new inputs:

./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
    lexer/fuzzer_corpus /tmp/new_corpus

Co-authored-by: Jon Meow 46229924+jonmeow@users.noreply.github.com

chandlerc and others added 2 commits December 5, 2020 07:13
Main original commit message:

> Add stubs of a diagnostic emission library. (#13)
>
> This doesn't have much wired up yet, but tries to lay out the most
> primitive API pattern.
The only change here is to update the fuzzer build extension path.

The main original commit message:

> Add an initial lexer. (#17)
>
> The specific logic here hasn't been updated to track the latest
> discussed changes, much less implement many aspects of things like
> Unicode support.
>
> However, this should lay out a reasonable framework and set of APIs.
> It gives an idea of the overall lexer architecture being proposed. The
> actual lexing algorithm is a relatively boring and naive hand written
> loop. It may make sense to replace this with something generated or
> other more advanced approach in the future, getting the implementation
> right was not the primary goal here. Instead, the focus was entirely
> on the architecture, encapsulation, APIs, and the testing
> infrastructure.
>
> The architecture of the lexer differs from "classical" high
> performance lexers in compilers. A high level summary:
>
> -   It is eager rather than lazy, lexing an entire file.
> -   Tokens intrinsically know their source location.
> -   Grouping lexical symbols are tracked within the lexer.
> -   Indentation is tracked within the lexer.
>
> Tracking of grouping and indentation is intended to simplify the
> strategies used for recovery of mismatched grouping tokens, and
> eventually use indentation.
>
> Folding source location into the token itself simplifies the data
> structures significantly, and doesn't lose any fidelity due to the
> absence of a preprocessor with token pasting.
>
> The fact that this is an eager lexer instead of a lazy lexer is
> designed to simplify the implementation and testing of the lexer (and
> subsequent components). There is no reason to expect Carbon to lex so
> many tokens that there are significant locality advantages of lazy
> lexing. Moreover, if we want comparable performance benefits, I think
> pipelining is a much more promising architecture than laziness. For
> now, the simplicity is a huge win.
>
> Being eager also makes it easy for us to use extremely dense memory
> encodings for the information about lexed tokens. Everything is
> created in a dense array, and small indices are used to identify each
> token within the array.
>
> There is a fuzzer included here that we have run extensively over the
> code, but currently toolchain bugs and Bazel limitations prevent it
> from easily building. I'm hoping myself or someone else can push on
> this soon and enable the fuzzer to at least build if not run fuzz
> tests automatically. We have a significant fuzzing corpus that I'll
> add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

> Add fuzz testing infrastructure and the lexer's fuzzer. (#21)
>
> This adds a fairly simple `cc_fuzz_test` macro that is specialized for
> working with LLVM's LibFuzzer. In addition to building the fuzzer
> binary with the toolchain's `fuzzer` feature, it also sets up the test
> execution to pass the corpus as file arguments which is a simple
> mechanism to enable regression testing against the fuzz corpus.
>
> I've included an initial fuzzer corpus as well. To run the fuzzer in
> an open ended fashion, and build up a larger corpus:
> ```shell
> mkdir /tmp/new_corpus
> cp lexer/fuzzer_corpus/* /tmp/new_corpus
> ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus
> ```
>
> You can parallelize the fuzzer by adding `-jobs=N` for N threads. For
> more details about running fuzzers, see the documentation:
> http://llvm.org/docs/LibFuzzer.html
>
> To minimize and merge any interesting new inputs:
> ```shell
> ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
>     lexer/fuzzer_corpus /tmp/new_corpus
> ```

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
@google-cla google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Dec 5, 2020
@chandlerc chandlerc changed the base branch from pull-212-merge-toolchain5 to trunk December 8, 2020 09:47
@chandlerc chandlerc merged commit f952552 into carbon-language:trunk Dec 8, 2020
@chandlerc chandlerc deleted the merge-toolchain6 branch December 8, 2020 09:53
chandlerc added a commit that referenced this pull request Jun 28, 2022
The only change here is to update the fuzzer build extension path.

The main original commit message:

> Add an initial lexer. (#17)
>
> The specific logic here hasn't been updated to track the latest
> discussed changes, much less implement many aspects of things like
> Unicode support.
>
> However, this should lay out a reasonable framework and set of APIs.
> It gives an idea of the overall lexer architecture being proposed. The
> actual lexing algorithm is a relatively boring and naive hand written
> loop. It may make sense to replace this with something generated or
> other more advanced approach in the future, getting the implementation
> right was not the primary goal here. Instead, the focus was entirely
> on the architecture, encapsulation, APIs, and the testing
> infrastructure.
>
> The architecture of the lexer differs from "classical" high
> performance lexers in compilers. A high level summary:
>
> -   It is eager rather than lazy, lexing an entire file.
> -   Tokens intrinsically know their source location.
> -   Grouping lexical symbols are tracked within the lexer.
> -   Indentation is tracked within the lexer.
>
> Tracking of grouping and indentation is intended to simplify the
> strategies used for recovery of mismatched grouping tokens, and
> eventually use indentation.
>
> Folding source location into the token itself simplifies the data
> structures significantly, and doesn't lose any fidelity due to the
> absence of a preprocessor with token pasting.
>
> The fact that this is an eager lexer instead of a lazy lexer is
> designed to simplify the implementation and testing of the lexer (and
> subsequent components). There is no reason to expect Carbon to lex so
> many tokens that there are significant locality advantages of lazy
> lexing. Moreover, if we want comparable performance benefits, I think
> pipelining is a much more promising architecture than laziness. For
> now, the simplicity is a huge win.
>
> Being eager also makes it easy for us to use extremely dense memory
> encodings for the information about lexed tokens. Everything is
> created in a dense array, and small indices are used to identify each
> token within the array.
>
> There is a fuzzer included here that we have run extensively over the
> code, but currently toolchain bugs and Bazel limitations prevent it
> from easily building. I'm hoping myself or someone else can push on
> this soon and enable the fuzzer to at least build if not run fuzz
> tests automatically. We have a significant fuzzing corpus that I'll
> add in a subsequent commit as well.

This also includes the fuzzer whose commit message was:

> Add fuzz testing infrastructure and the lexer's fuzzer. (#21)
>
> This adds a fairly simple `cc_fuzz_test` macro that is specialized for
> working with LLVM's LibFuzzer. In addition to building the fuzzer
> binary with the toolchain's `fuzzer` feature, it also sets up the test
> execution to pass the corpus as file arguments which is a simple
> mechanism to enable regression testing against the fuzz corpus.
>
> I've included an initial fuzzer corpus as well. To run the fuzzer in
> an open ended fashion, and build up a larger corpus:
> ```shell
> mkdir /tmp/new_corpus
> cp lexer/fuzzer_corpus/* /tmp/new_corpus
> ./bazel-bin/lexer/tokenized_buffer_fuzzer /tmp/new_corpus
> ```
>
> You can parallelize the fuzzer by adding `-jobs=N` for N threads. For
> more details about running fuzzers, see the documentation:
> http://llvm.org/docs/LibFuzzer.html
>
> To minimize and merge any interesting new inputs:
> ```shell
> ./bazel-bin/lexer/tokenized_buffer_fuzzer -merge=1 \
>     lexer/fuzzer_corpus /tmp/new_corpus
> ```

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes PR meets CLA requirements according to bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants