Skip to content

Commit

Permalink
Define a lexical specification interface
Browse files Browse the repository at this point in the history
  • Loading branch information
nihei9 committed Sep 11, 2021
1 parent 6332aaf commit 96a555a
Show file tree
Hide file tree
Showing 6 changed files with 373 additions and 356 deletions.
29 changes: 14 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,18 +47,18 @@ If you want to make sure that the lexical specification behaves as expected, you
⚠️ An encoding that `maleeni lex` and the driver can handle is only UTF-8.

```sh
$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind_id, .kind_name, .text, .eof] | @csv'
2,"word","The",false
1,"whitespace"," ",false
2,"word","truth",false
1,"whitespace"," ",false
2,"word","is",false
1,"whitespace"," ",false
2,"word","out",false
1,"whitespace"," ",false
2,"word","there",false
3,"punctuation",".",false
0,"","",true
$ echo -n 'The truth is out there.' | maleeni lex clexspec.json | jq -r '[.kind_name, .lexeme, .eof] | @csv'
"word","The",false
"whitespace"," ",false
"word","truth",false
"whitespace"," ",false
"word","is",false
"whitespace"," ",false
"word","out",false
"whitespace"," ",false
"word","there",false
"punctuation",".",false
"","",true
```

The JSON format of tokens that `maleeni lex` command prints is as follows:
Expand All @@ -72,8 +72,7 @@ The JSON format of tokens that `maleeni lex` command prints is as follows:
| kind_name | string | A name of a lexical kind. |
| row | integer | A row number where a lexeme appears. |
| col | integer | A column number where a lexeme appears. Note that `col` is counted in code points, not bytes. |
| match | array of integers | A byte sequense of a lexeme. |
| text | string | A string representation of a lexeme. |
| lexeme | array of integers | A byte sequense of a lexeme. |
| eof | bool | When this field is `true`, it means the token is the EOF token. |
| invalid | bool | When this field is `true`, it means the token is an error token. |

Expand Down Expand Up @@ -336,7 +335,7 @@ For instance, you can define a subset of [the string literal of golang](https://
In the above specification, when the `"` mark appears in default mode (it's the initial mode), the driver transitions to the `string` mode and interprets character sequences (`char_seq`) and escape sequences (`escaped_char`). When the `"` mark appears the next time, the driver returns to the `default` mode.

```sh
$ echo -n '"foo\nbar"foo' | maleeni lex go-string-cspec.json | jq -r '[.mode_name, .kind_name, .text, .eof] | @csv'
$ echo -n '"foo\nbar"foo' | maleeni lex go-string-cspec.json | jq -r '[.mode_name, .kind_name, .lexeme, .eof] | @csv'
"default","string_open","""",false
"string","char_seq","foo",false
"string","escaped_char","\n",false
Expand Down
2 changes: 1 addition & 1 deletion cmd/maleeni/lex.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ func runLex(cmd *cobra.Command, args []string) (retErr error) {
defer f.Close()
src = f
}
lex, err = driver.NewLexer(clspec, src)
lex, err = driver.NewLexer(driver.NewLexSpec(clspec), src)
if err != nil {
return err
}
Expand Down
Loading

0 comments on commit 96a555a

Please sign in to comment.