Skip to content

Latest commit

 

History

History
427 lines (330 loc) · 17.6 KB

README.md

File metadata and controls

427 lines (330 loc) · 17.6 KB

maleeni

maleeni is a lexer generator for golang. maleeni also provides a command to perform lexical analysis to allow easy debugging of your lexical specification.

ci

Installation

Compiler:

$ go install github.com/nihei9/maleeni/cmd/maleeni@latest

Code Generator:

$ go install github.com/nihei9/maleeni/cmd/maleeni-go@latest

Usage

1. Define your lexical specification

First, define your lexical specification in JSON format. As an example, let's write the definitions of whitespace, words, and punctuation.

{
    "name": "statement",
    "entries": [
        {
            "kind": "whitespace",
            "pattern": "[\\u{0009}\\u{000A}\\u{000D}\\u{0020}]+"
        },
        {
            "kind": "word",
            "pattern": "[0-9A-Za-z]+"
        },
        {
            "kind": "punctuation",
            "pattern": "[.,:;]"
        }
    ]
}

Save the above specification to a file. In this explanation, the file name is statement.json.

⚠️ The input file must be encoded in UTF-8.

2. Compile the lexical specification

Next, generate a DFA from the lexical specification using maleeni compile command.

$ maleeni compile statement.json -o statementc.json

3. Debug (Optional)

If you want to make sure that the lexical specification behaves as expected, you can use maleeni lex command to try lexical analysis without having to generate a lexer. maleeni lex command outputs tokens in JSON format. For simplicity, print significant fields of the tokens in CSV format using jq command.

⚠️ An encoding that maleeni lex and the driver can handle is only UTF-8.

$ echo -n 'The truth is out there.' | maleeni lex statementc.json | jq -r '[.kind_name, .lexeme, .eof] | @csv'
"word","The",false
"whitespace"," ",false
"word","truth",false
"whitespace"," ",false
"word","is",false
"whitespace"," ",false
"word","out",false
"whitespace"," ",false
"word","there",false
"punctuation",".",false
"","",true

The JSON format of tokens that maleeni lex command prints is as follows:

Field Type Description
mode_id integer An ID of a lex mode.
mode_name string A name of a lex mode.
kind_id integer An ID of a kind. This is unique among all modes.
mode_kind_id integer An ID of a lexical kind. This is unique only within a mode. Note that you need to use kind_id field if you want to identify a kind across all modes.
kind_name string A name of a lexical kind.
row integer A row number where a lexeme appears.
col integer A column number where a lexeme appears. Note that col is counted in code points, not bytes.
lexeme array of integers A byte sequense of a lexeme.
eof bool When this field is true, it means the token is the EOF token.
invalid bool When this field is true, it means the token is an error token.

4. Generate the lexer

Using maleeni-go command, you can generate a source code of the lexer to recognize your lexical specification.

$ maleeni-go statementc.json

The above command generates the lexer and saves it to statement_lexer.go file. By default, the file name will be {spec name}_lexer.json. To use the lexer, you need to call NewLexer function defined in statement_lexer.go. The following code is a simple example. In this example, the lexer reads a source code from stdin and writes the result, tokens, to stdout.

package main

import (
    "fmt"
    "os"
)

func main() {
    lex, err := NewLexer(NewLexSpec(), os.Stdin)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }

    for {
        tok, err := lex.Next()
        if err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
        if tok.EOF {
            break
        }
        if tok.Invalid {
            fmt.Printf("invalid: %#v\n", string(tok.Lexeme))
        } else {
            fmt.Printf("valid: %v: %#v\n", KindIDToName(tok.KindID), string(tok.Lexeme))
        }
    }
}

Please save the above source code to main.go and create a directory structure like the one below.

/project_root
├── statement_lexer.go ... Lexer generated from the compiled lexical specification (the result of `maleeni-go`).
└── main.go .............. Caller of the lexer.

Now, you can perform the lexical analysis.

$ echo -n 'I want to believe.' | go run main.go statement_lexer.go
valid: word: "I"
valid: whitespace: " "
valid: word: "want"
valid: whitespace: " "
valid: word: "to"
valid: whitespace: " "
valid: word: "believe"
valid: punctuation: "."

More Practical Usage

See also this example.

Lexical Specification Format

The lexical specification format to be passed to maleeni compile command is as follows:

top level object:

Field Type Domain Nullable Description
name string id false A specification name.
entries array of entry objects N/A false An array of entries sorted by priority. The first element has the highest priority, and the last has the lowest priority.

entry object:

Field Type Domain Nullable Description
kind string id false A name of a token kind. The name must be unique, but duplicate names between fragments and non-fragments are allowed.
pattern string regexp false A pattern in a regular expression
modes array of strings N/A true Mode names that an entry is enabled in (default: "default")
push string id true A mode name that the lexer pushes to own mode stack when a token matching the pattern appears
pop bool N/A true When pop is true, the lexer pops a mode from own mode stack.
fragment bool N/A true When fragment is true, its entry is a fragment.

See Identifier and Regular Expression for more details on id domain and regexp domain.

Identifier

id represents an identifier and must follow the rules below:

  • id must be a lower snake case. It can contain only a to z, 0 to 9, and _.
  • The first and last characters must be one of a to z.
  • _ cannot appear consecutively.

Regular Expression

regexp represents a regular expression. Its syntax is below:

⚠️ In JSON, you need to write \ as \\.

⚠️ maleeni doesn't allow you to use some code points. See Unavailable Code Points.

Composites

Concatenation and alternation allow you to combine multiple characters or multiple patterns into one pattern.

Pattern Matches
abc abc
abc|def abc or def

Single Characters

In addition to using ordinary characters, there are other ways to represent a single character:

  • dot expression
  • bracket expressions
  • code point expressions
  • character property expressions
  • escape sequences

Dot Expression

The dot expression matches any one chracter.

Pattern Matches
. any one character

Bracket Expressions

The bracket expressions are represented by enclosing characters in [ ] or [^ ]. [^ ] is negation of [ ]. For instance, [ab] matches one of a or b, and [^ab] matches any one character except a and b.

Pattern Matches
[abc] a, b, or c
[^abc] any one character except a, b, and c
[a-z] one in the range of a to z
[a-] a or -
[-z] - or z
[-] -
[^a-z] any one character except the range of a to z
[a^] a or ^

Code Point Expressions

The code point expressions match a character that has a specified code point. The code points consists of a four or six digits hex string.

Pattern Matches
\u{000A} U+000A (LF)
\u{3042} U+3042 (hiragana )
\u{01F63A} U+1F63A (grinning cat 😺)

Character Property Expressions

The character property expressions match a character that has a specified character property of the Unicode. Currently, maleeni supports General_Category, Script, Alphabetic, Lowercase, Uppercase, and White_Space. When you omitted the equal symbol and a right-side value, maleeni interprets a symbol in \p{...} as the General_Category value.

Pattern Matches
\p{General_Category=Letter} any one character whose General_Category is Letter
\p{gc=Letter} the same as \p{General_Category=Letter}
\p{Letter} the same as \p{General_Category=Letter}
\p{l} the same as \p{General_Category=Letter}
\p{Script=Latin} any one character whose Script is Latin
\p{Alphabetic=yes} any one character whose Alphabetic is yes
\p{Lowercase=yes} any one character whose Lowercase is yes
\p{Uppercase=yes} any one character whose Uppercase is yes
\p{White_Space=yes} any one character whose White_Space is yes

Escape Sequences

As you escape the special character with \, you can write a rule that matches the special character itself. The following escape sequences are available outside of bracket expressions.

Pattern Matches
\\. .
\\? ?
\\* *
\\+ +
\\( (
\\) )
\\[ [
\\| |
\\\\ \\

The following escape sequences are available inside bracket expressions.

Pattern Matches
\\^ ^
\\- -
\\] ]

Repetitions

The repetitions match a string that repeats the previous single character or group.

Pattern Matches
a* zero or more a
a+ one or more a
a? zero or one a

Grouping

( and ) groups any patterns.

Pattern Matches
a(bc)*d ad, abcd, abcbcd, and so on
(ab|cd)+ ab, cd, abcd, cdab, abcdab, and so on

Fragment

The fragment is a feature that allows you to define a part of a pattern. This feature is useful for decomposing complex patterns into simple patterns and for defining common parts between patterns. A fragment entry is defined by an entry whose fragment field is true, and is referenced by a fragment expression (\f{...}). Fragment patterns can be nested, but they are not allowed to contain circular references.

For instance, you can define an identifier of golang as follows:

{
    "name": "id",
    "entries": [
        {
            "fragment": true,
            "kind": "unicode_letter",
            "pattern": "\\p{Letter}"
        },
        {
            "fragment": true,
            "kind": "unicode_digit",
            "pattern": "\\p{Number}"
        },
        {
            "fragment": true,
            "kind": "letter",
            "pattern": "\\f{unicode_letter}|_"
        },
        {
            "kind": "identifier",
            "pattern": "\\f{letter}(\\f{letter}|\\f{unicode_digit})*"
        }
    ]
}

Unavailable Code Points

Lexical specifications and source files to be analyzed cannot contain the following code points.

When you write a pattern that implicitly contains the unavailable code points, maleeni will automatically generate a pattern that doesn't contain the unavailable code points and replaces the original pattern. However, when you explicitly use the unavailable code points (like \u{U+D800} or \p{General_Category=Cs}), maleeni will occur an error.

  • surrogate code points: U+D800..U+DFFF

Lex Mode

Lex Mode is a feature that allows you to separate a DFA transition table for each mode.

modes field of an entry in a lexical specification indicates in which mode the entry is enabled. If modes field is empty, the entry is enabled only in the default mode. The compiler groups the entries and generates a DFA for each mode. Thus the driver can switch the transition table by switching modes. The mode switching follows push or pop field of each entry.

For instance, you can define a subset of the string literal of golang as follows:

{
    "name": "string",
    "entries": [
        {
            "kind": "string_open",
            "pattern": "\"",
            "push": "string"
        },
        {
            "modes": ["string"],
            "kind": "char_seq",
            "pattern": "[^\\u{000A}\"\\\\]+"
        },
        {
            "modes": ["string"],
            "kind": "escaped_char",
            "pattern": "\\\\[abfnrtv\\\\'\"]"
        },
        {
            "modes": ["string"],
            "kind": "escape_symbol",
            "pattern": "\\\\"
        },
        {
            "modes": ["string"],
            "kind": "newline",
            "pattern": "\\u{000A}"
        },
        {
            "modes": ["string"],
            "kind": "string_close",
            "pattern": "\"",
            "pop": true
        },
        {
            "kind": "identifier",
            "pattern": "[A-Za-z_][0-9A-Za-z_]*"
        }
    ]
}

In the above specification, when the " mark appears in default mode (it's the initial mode), the driver transitions to the string mode and interprets character sequences (char_seq) and escape sequences (escaped_char). When the " mark appears the next time, the driver returns to the default mode.

$ echo -n '"foo\nbar"foo' | maleeni lex stringc.json | jq -r '[.mode_name, .kind_name, .lexeme, .eof] | @csv'
"default","string_open","""",false
"string","char_seq","foo",false
"string","escaped_char","\n",false
"string","char_seq","bar",false
"string","string_close","""",false
"default","identifier","foo",false
"default","","",true

The input string enclosed in the " mark (foo\nbar) are interpreted as the char_seq and the escaped_char, while the outer string (foo) is interpreted as the identifier. The same string foo is interpreted as different types because of the different modes in which they are interpreted.

Unicode Version

maleeni references Unicode 13.0.0.