maleeni is a lexer generator for golang. maleeni also provides a command to perform lexical analysis to allow easy debugging of your lexical specification.
Compiler:
$ go install github.com/nihei9/maleeni/cmd/maleeni@latest
Code Generator:
$ go install github.com/nihei9/maleeni/cmd/maleeni-go@latest
First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.
{
"name": "statement",
"entries": [
{
"kind": "whitespace",
"pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
},
{
"kind": "word",
"pattern": "[0-9A-Za-z]+"
},
{
"kind": "punctuation",
"pattern": "[.,:;]"
}
]
}
Save the above specification to a file. In this explanation, the file name is statement.json
.
Next, generate a DFA from the lexical specification using maleeni compile
command.
$ maleeni compile statement.json -o statementc.json
If you want to make sure that the lexical specification behaves as expected, you can use maleeni lex
command to try lexical analysis without having to generate a lexer. maleeni lex
command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.
maleeni lex
and the driver can handle is only UTF-8.
$ echo -n 'The truth is out there.' | maleeni lex statementc.json | jq -r '[.kind_name, .lexeme, .eof] | @csv'
"word","The",false
"whitespace"," ",false
"word","truth",false
"whitespace"," ",false
"word","is",false
"whitespace"," ",false
"word","out",false
"whitespace"," ",false
"word","there",false
"punctuation",".",false
"","",true
The JSON format of tokens that maleeni lex
command prints is as follows:
Field | Type | Description |
---|---|---|
mode_id | integer | An ID of a lex mode. |
mode_name | string | A name of a lex mode. |
kind_id | integer | An ID of a kind. This is unique among all modes. |
mode_kind_id | integer | An ID of a lexical kind. This is unique only within a mode. Note that you need to use kind_id field if you want to identify a kind across all modes. |
kind_name | string | A name of a lexical kind. |
row | integer | A row number where a lexeme appears. |
col | integer | A column number where a lexeme appears. Note that col is counted in code points, not bytes. |
lexeme | array of integers | A byte sequense of a lexeme. |
eof | bool | When this field is true , it means the token is the EOF token. |
invalid | bool | When this field is true , it means the token is an error token. |
Using maleeni-go
command, you can generate a source code of the lexer to recognize your lexical specification.
$ maleeni-go statementc.json
The above command generates the lexer and saves it to statement_lexer.go
file. By default, the file name will be {spec name}_lexer.json
. To use the lexer, you need to call NewLexer
function defined in statement_lexer.go
. The following code is a simple example. In this example, the lexer reads a source code from stdin and writes the result, tokens, to stdout.
package main
import (
"fmt"
"os"
)
func main() {
lex, err := NewLexer(NewLexSpec(), os.Stdin)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
for {
tok, err := lex.Next()
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}
if tok.EOF {
break
}
if tok.Invalid {
fmt.Printf("invalid: %#v\n", string(tok.Lexeme))
} else {
fmt.Printf("valid: %v: %#v\n", KindIDToName(tok.KindID), string(tok.Lexeme))
}
}
}
Please save the above source code to main.go
and create a directory structure like the one below.
/project_root
├── statement_lexer.go ... Lexer generated from the compiled lexical specification (the result of `maleeni-go`).
└── main.go .............. Caller of the lexer.
Now, you can perform the lexical analysis.
$ echo -n 'I want to believe.' | go run main.go statement_lexer.go
valid: word: "I"
valid: whitespace: " "
valid: word: "want"
valid: whitespace: " "
valid: word: "to"
valid: whitespace: " "
valid: word: "believe"
valid: punctuation: "."
See also this example.
The lexical specification format to be passed to maleeni compile
command is as follows:
top level object:
Field | Type | Domain | Nullable | Description |
---|---|---|---|---|
name | string | id | false | A specification name. |
entries | array of entry objects | N/A | false | An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority. |
entry object:
Field | Type | Domain | Nullable | Description |
---|---|---|---|---|
kind | string | id | false | A name of a token kind. The name must be unique, but duplicate names between fragments and non-fragments are allowed. |
pattern | string | regexp | false | A pattern in a regular expression |
modes | array of strings | N/A | true | Mode names that an entry is enabled in (default: "default") |
push | string | id | true | A mode name that the lexer pushes to own mode stack when a token matching the pattern appears |
pop | bool | N/A | true | When pop is true , the lexer pops a mode from own mode stack. |
fragment | bool | N/A | true | When fragment is true , its entry is a fragment. |
See Identifier and Regular Expression for more details on id
domain and regexp
domain.
id
represents an identifier and must follow the rules below:
id
must be a lower snake case. It can contain onlya
toz
,0
to9
, and_
.- The first and last characters must be one of
a
toz
. _
cannot appear consecutively.
regexp
represents a regular expression. Its syntax is below:
\
as \\
.
Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.
Pattern | Matches |
---|---|
abc |
abc |
abc|def |
abc or def |
In addition to using ordinary characters, there are other ways to represent a single character:
- dot expression
- bracket expressions
- code point expressions
- character property expressions
- escape sequences
The dot expression matches any one chracter.
Pattern | Matches |
---|---|
. |
any one character |
The bracket expressions are represented by enclosing characters in [ ]
or [^ ]
. [^ ]
is negation of [ ]
. For instance, [ab]
matches one of a
or b
, and [^ab]
matches any one character except a
and b
.
Pattern | Matches |
---|---|
[abc] |
a , b , or c |
[^abc] |
any one character except a , b , and c |
[a-z] |
one in the range of a to z |
[a-] |
a or - |
[-z] |
- or z |
[-] |
- |
[^a-z] |
any one character except the range of a to z |
[a^] |
a or ^ |
The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.
Pattern | Matches |
---|---|
\u{000A} |
U+000A (LF) |
\u{3042} |
U+3042 (hiragana あ ) |
\u{01F63A} |
U+1F63A (grinning cat 😺 ) |
The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports General_Category
, Script
, Alphabetic
, Lowercase
, Uppercase
, and White_Space
. When you omitted the equal symbol and a right-side value, maleeni interprets a symbol in \p{...}
as the General_Category
value.
Pattern | Matches |
---|---|
\p{General_Category=Letter} |
any one character whose General_Category is Letter |
\p{gc=Letter} |
the same as \p{General_Category=Letter} |
\p{Letter} |
the same as \p{General_Category=Letter} |
\p{l} |
the same as \p{General_Category=Letter} |
\p{Script=Latin} |
any one character whose Script is Latin |
\p{Alphabetic=yes} |
any one character whose Alphabetic is yes |
\p{Lowercase=yes} |
any one character whose Lowercase is yes |
\p{Uppercase=yes} |
any one character whose Uppercase is yes |
\p{White_Space=yes} |
any one character whose White_Space is yes |
As you escape the special character with \
, you can write a rule that matches the special character itself.
The following escape sequences are available outside of bracket expressions.
Pattern | Matches |
---|---|
\\. |
. |
\\? |
? |
\\* |
* |
\\+ |
+ |
\\( |
( |
\\) |
) |
\\[ |
[ |
\\| |
| |
\\\\ |
\\ |
The following escape sequences are available inside bracket expressions.
Pattern | Matches |
---|---|
\\^ |
^ |
\\- |
- |
\\] |
] |
The repetitions match a string that repeats the previous single character or group.
Pattern | Matches |
---|---|
a* |
zero or more a |
a+ |
one or more a |
a? |
zero or one a |
(
and )
groups any patterns.
Pattern | Matches |
---|---|
a(bc)*d |
ad , abcd , abcbcd , and so on |
(ab|cd)+ |
ab , cd , abcd , cdab , abcdab , and so on |
The fragment is a feature that allows you to define a part of a pattern. This feature is useful for decomposing complex patterns into simple patterns and for defining common parts between patterns.
A fragment entry is defined by an entry whose fragment
field is true
, and is referenced by a fragment expression (\f{...}
).
Fragment patterns can be nested, but they are not allowed to contain circular references.
For instance, you can define an identifier of golang as follows:
{
"name": "id",
"entries": [
{
"fragment": true,
"kind": "unicode_letter",
"pattern": "\\p{Letter}"
},
{
"fragment": true,
"kind": "unicode_digit",
"pattern": "\\p{Number}"
},
{
"fragment": true,
"kind": "letter",
"pattern": "\\f{unicode_letter}|_"
},
{
"kind": "identifier",
"pattern": "\\f{letter}(\\f{letter}|\\f{unicode_digit})*"
}
]
}
Lexical specifications and source files to be analyzed cannot contain the following code points.
When you write a pattern that implicitly contains the unavailable code points, maleeni will automatically generate a pattern that doesn't contain the unavailable code points and replaces the original pattern. However, when you explicitly use the unavailable code points (like \u{U+D800}
or \p{General_Category=Cs}
), maleeni will occur an error.
- surrogate code points: U+D800..U+DFFF
Lex Mode is a feature that allows you to separate a DFA transition table for each mode.
modes
field of an entry in a lexical specification indicates in which mode the entry is enabled. If modes
field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows push
or pop
field of each entry.
For instance, you can define a subset of the string literal of golang as follows:
{
"name": "string",
"entries": [
{
"kind": "string_open",
"pattern": "\"",
"push": "string"
},
{
"modes": ["string"],
"kind": "char_seq",
"pattern": "[^\\u{000A}\"\\\\]+"
},
{
"modes": ["string"],
"kind": "escaped_char",
"pattern": "\\\\[abfnrtv\\\\'\"]"
},
{
"modes": ["string"],
"kind": "escape_symbol",
"pattern": "\\\\"
},
{
"modes": ["string"],
"kind": "newline",
"pattern": "\\u{000A}"
},
{
"modes": ["string"],
"kind": "string_close",
"pattern": "\"",
"pop": true
},
{
"kind": "identifier",
"pattern": "[A-Za-z_][0-9A-Za-z_]*"
}
]
}
In the above specification, when the "
mark appears in default mode (it's the initial mode), the driver transitions to the string
mode and interprets character sequences (char_seq
) and escape sequences (escaped_char
). When the "
mark appears the next time, the driver returns to the default
mode.
$ echo -n '"foo\nbar"foo' | maleeni lex stringc.json | jq -r '[.mode_name, .kind_name, .lexeme, .eof] | @csv'
"default","string_open","""",false
"string","char_seq","foo",false
"string","escaped_char","\n",false
"string","char_seq","bar",false
"string","string_close","""",false
"default","identifier","foo",false
"default","","",true
The input string enclosed in the "
mark (foo\nbar
) are interpreted as the char_seq
and the escaped_char
, while the outer string (foo
) is interpreted as the identifier
. The same string foo
is interpreted as different types because of the different modes in which they are interpreted.
maleeni references Unicode 13.0.0.