Flexible String Tokenization Library for JavaScript
Tokenizr is a small JavaScript library, providing powerful and flexible string tokenization functionality. It is intended to be be used as the underlying "lexical scanner" in a Recursive Descent based "syntax parser", but can be used for other parsing purposes, too. Its distinct features are:
-
Efficient Iteration:
It iterates over the input character string in a read-only and copy-less fashion. -
Stacked States:
Its tokenization is based on stacked states for determining rules which can be applied. Each rule can be enabled for one or more particular states only. -
Regular Expression Matching:
Its tokenization is based on powerful Regular Expressions for matching the input string. -
Match Repeating:
Rule actions can change the state and then force the repeating of the matching process from scratch at the current input position. -
Match Rejecting:
Rule actions can reject their matching at the current input position and let subsequent rules to still match. -
Match Ignoring:
Rule actions can force the matched input to be ignored (without generating a token at all). -
Match Accepting:
Rule actions can accept the matched input and provide one or even more resulting tokens. -
Shared Context Data:
Rule actions (during tokenization) can optionally store and retrieve arbitrary values to/from their tokenization context to share data between rules. -
Token Text and Value:
Tokens provide information about their matched input text and can provide a different corresponding (pre-parsed) value, too. -
Debug Mode:
The tokenization process can be debugged through optional detailed logging of the internal processing. -
Nestable Transactions:
The tokenization can be split into distinct (and nestable) transactions which can be committed or rolled back. This way the tokenization can be incrementally stepped back and this way support the attempt of parsing alternatives. -
Token Look-Ahead:
The forthcoming tokens can be inspected, to support alternative decisions from within the parser, based on look-ahead tokens.
$ npm install tokenizr
$ bower install tokenizr
Suppose we have a configuration file sample.cfg
:
foo {
baz = 1 // sample comment
bar {
quux = 42
hello = "hello \"world\"!"
}
quux = 7
}
Then we can write a lexical scanner in ECMAScript 6 (under Node.js) for the tokens like this:
import fs from "fs"
import Tokenizr from "tokenizr"
let lexer = new Tokenizr()
lexer.rule(/[a-zA-Z_][a-zA-Z0-9_]*/, (ctx, match) => {
ctx.accept("id")
})
lexer.rule(/[+-]?[0-9]+/, (ctx, match) => {
ctx.accept("number", parseInt(match[0]))
})
lexer.rule(/"((?:\\"|[^\r\n])*)"/, (ctx, match) => {
ctx.accept("string", match[1].replace(/\\"/g, "\""))
})
lexer.rule(/\/\/[^\r\n]*\r?\n/, (ctx, match) => {
ctx.ignore()
})
lexer.rule(/[ \t\r\n]+/, (ctx, match) => {
ctx.ignore()
})
lexer.rule(/./, (ctx, match) => {
ctx.accept("char")
})
let cfg = fs.readFileSync("sample.cfg", "utf8")
lexer.input(cfg)
lexer.debug(true)
lexer.tokens().forEach((token) => {
console.log(token.toString())
})
The output of running this sample program is:
<type: id, value: "foo", text: "foo", pos: 0, line: 1, column: 1>
<type: char, value: "{", text: "{", pos: 4, line: 1, column: 5>
<type: id, value: "baz", text: "baz", pos: 10, line: 2, column: 5>
<type: char, value: "=", text: "=", pos: 14, line: 2, column: 9>
<type: number, value: 1, text: "1", pos: 16, line: 2, column: 11>
<type: id, value: "bar", text: "bar", pos: 40, line: 3, column: 5>
<type: char, value: "{", text: "{", pos: 44, line: 3, column: 9>
<type: id, value: "quux", text: "quux", pos: 54, line: 4, column: 9>
<type: char, value: "=", text: "=", pos: 59, line: 4, column: 14>
<type: number, value: 42, text: "42", pos: 61, line: 4, column: 16>
<type: id, value: "hello", text: "hello", pos: 72, line: 5, column: 9>
<type: char, value: "=", text: "=", pos: 78, line: 5, column: 15>
<type: string, value: "hello \"world\"!", text: "\"hello \\\"world\\\"!\"", pos: 80, line: 5, column: 17>
<type: char, value: "}", text: "}", pos: 103, line: 6, column: 5>
<type: id, value: "quux", text: "quux", pos: 109, line: 7, column: 5>
<type: char, value: "=", text: "=", pos: 114, line: 7, column: 10>
<type: number, value: 7, text: "7", pos: 116, line: 7, column: 12>
<type: char, value: "}", text: "}", pos: 118, line: 8, column: 1>
<type: EOF, value: "", text: "", pos: 122, line: 9, column: 1>
If you want to combine multiple single-char plaintext tokens into a multi-char plaintext token, you can use the following code fragment:
let plaintext = ""
lexer.before((ctx, match, rule) => {
if (rule.name !== "plaintext" && plaintext !== "") {
ctx.accept("plaintext", plaintext)
plaintext = ""
}
})
lexer.rule(/./, (ctx, match) => {
plaintext += match[0]
ctx.ignore()
}, "plaintext")
lexer.finish((ctx) => {
if (plaintext !== "")
ctx.accept("plaintext", plaintext)
})
With the additional help of an Abstract Syntax Tree (AST) library like ASTy and a query library like ASTq you can write powerful Recursive Descent based parsers which parse such a token stream into an AST and then query and process the AST.
This is the main API class for establishing a lexical scanner.
-
Constructor:
Tokenizr(): Tokenizr
Create a new tokenization instance. -
Method:
Tokenizr#reset(): Tokenizr
Reset the tokenization instance to a fresh one by discarding all internal state information. -
Method:
Tokenizr#debug(enable: Boolean): Tokenizr
Enable (or disable) verbose logging for debugging purposes. -
Method:
Tokenizr#input(input: String): Tokenizr
Set the input string to tokenize. This implicitly performs areset()
operation beforehand. -
Method:
Tokenizr#push(state: String): Tokenizr
Push a state onto the state stack. -
Method:
Tokenizr#pop(): String
Pop a state from the state stack. The initial (aka first or lowest) stack value (default
) cannot be popped. -
Method:
Tokenizr#state(state: String): Tokenizr
Method:Tokenizr#state(): String
Set or get the state on the top of the state stack. Use this to initialy start tokenizing with a custom state. The initial state is nameddefault
. -
Method:
Tokenizr#tag(tag: String): Tokenizr
Set a tag. The tag has to be matched by rules. -
Method:
Tokenizr#tagged(tag: String): Boolean
Check whether a particular tag is set. -
Method:
Tokenizr#untag(tag: String): Tokenizr
Unset a particular tag. The tag no longer will be matched by rules. -
Method:
Tokenizr#before(action: (ctx: ActionContext, match: Array[String], rule: { state: String, pattern: RegExp, action: Function, name: String }) => Void): Tokenizr
Configure a single action which is called directly before any rule action (configured withTokenizr#rule()
) is called. This can be used to execute a common action just before all rule actions. Therule
argument is theTokenizr#rule()
information of the particular rule which is executed. -
Method:
Tokenizr#after(action: (ctx: ActionContext, match: Array[String], rule: { state: String, pattern: RegExp, action: Function, name: String }) => Void): Tokenizr
Configure a single action which is called directly after any rule action (configured withTokenizr#rule()
) is called. This can be used to execute a common action just after all rule actions. Therule
argument is theTokenizr#rule()
information of the particular rule which is executed. -
Method:
Tokenizr#finish(action: (ctx: ActionContext) => Void): Tokenizr
Configure a single action which is called directly before anEOF
token is emitted. This can be used to execute a common action just after the last rule action was called. -
Method:
Tokenizr#rule(state?: String, pattern: RegExp, action: (ctx: ActionContext, match: Array[String]) => Void, name: String): Tokenizr
Configure a token matching rule namedname
, which executes itsaction
in case the current tokenization state is one of the states (and all of the currently set tags) instate
(by default the rule matches all states ifstate
is not specified) and the next input characters match against thepattern
. The exact syntax ofstate
is<state>[ #<tag> #<tag> ...][, <state>[ #<tag> #<tag> ...], ...]
, i.e., it is one or more comma-separated state matches (OR-combined) and each state match has exactly one state and zero or more space-separated tags (AND-combined). Thectx
argument provides a context object for token repeating/rejecting/ignoring/accepting, thematch
argument is the result of the underlyingRegExp#exec
call. -
Method:
Tokenizr#token(): Tokenizr.Token
Get the next token from the input. Internally, the current position of the input is matched against the patterns of all rules (in rule configuration order). The first rule action which accepts the matching leads to the token. -
Method:
Tokenizr#tokens(): Array[Tokenizr.Token]
Tokenizes the entire input and returns all the corresponding tokens. This is a convenience method only. Usually one takes just single tokens at a time withTokenizr#token()
. -
Method:
Tokenizr#skip(next?: Number): Tokenizr
Get and discard thenext
number of following tokens withTokenizr#token()
. -
Method:
Tokenizr#consume(type: String, value?: String): Tokenizr.Token
Match (withTokenizr.Token#isA
) the next token. If it matchestype
and optionally alsovalue
, consume it. If it does not match, throw aTokenizr.ParsingError
. This is the primary function used in Recursive Descent parsers. -
Method:
Tokenizr#peek(offset?: Number): Tokenizr.Token
Peek at the following token at the (0-based) offset without consuming the token. This is the secondary function used in Recursive Descent parsers. -
Method:
Tokenizr#begin(): Tokenizr
Begin a transaction. UntilTokenizr#commit()
orTokenizr#rollback()
are called, all consumed tokens will be internally remembered and be either thrown away (onTokenizr#commit()
) or pushed back (onTokenizr#rollback()
). This can be used multiple times and this way supports nested transactions. It is intended to be used for tokenizing alternatives. -
Method:
Tokenizr#depth(): Number
Return the number of already consumed tokens in the currently active transaction. This is useful if multiple alternatives are parsed and in case all failed, to report the error for the most specific one, i.e., the one which consumed most tokens. -
Method:
Tokenizr#commit(): Tokenizr
End a transaction successfully. All consumed tokens are finally gone. -
Method:
Tokenizr#rollback(): Tokenizr
End a transaction unsuccessfully. All consumed tokens are pushed back and can be consumed again. -
Method:
Tokenizr#alternatives(...alternatives: Array[() => any]): any
Utility method for parsing alternatives. It internally executes the supplied callback functions in sequence, each wrapped into its own transaction. The first one which succeeds (does not throw an exception and returns a value) leads to the successful result. In case all alternatives failed (all throw an exception), the exception of the most-specific alterative (the one with the largest transaction depth) is re-thrown. Thethis
in each callback function points to theTokenizr
object on whichalternatives
was called. -
Method:
Tokenizr#error(message: String): Tokenizr.ParsingError
Returns a new instance ofTokenizr.ParsingError
, based on the current input character stream position, and withTokenizr.ParsingError#message
set tomessage
.
This is the class of all returned tokens.
-
Property:
Tokenizr.Token#type: String
The type of the token as specified onTokenizr.ActionContext#accept()
. -
Property:
Tokenizr.Token#value: any
The value of the token. By default this is the same asTokenizr.Token#text
, but can be any pre-processed value as specified onTokenizr.ActionContext#accept()
. -
Property:
Tokenizr.Token#text: String
The corresponding input text of this token. -
Property:
Tokenizr.Token#pos: Number
The (0-based) position in the input. -
Property:
Tokenizr.Token#line: Number
The (1-based) line number in the input. -
Property:
Tokenizr.Token#column: Number
The (1-based) column number in the input. -
Method:
Tokenizr.Token#toString(colorize?: (type: String, value: String) => String): String
Returns a formatted representation of the token, usually for debugging or tracing purposes only. The optionalcolorize
callback can be used to colorize the output. -
Method:
Tokenizr.Token#isA(type: String, value?: any): String
Checks whether token matches against a particulartype
and optionally a particularvalue
. This is especially used internally byTokenizr#consume()
.
This is the class of all thrown exceptions related to parsing.
-
Property:
Tokenizr.ParsingError#name: String
Always just the stringParsingError
to be complaint to the JavaScriptError
class specification. -
Property:
Tokenizr.ParsingError#message: String
The particular error message. -
Property:
Tokenizr.ParsingError#pos: Number
The (0-based) position in the input. -
Property:
Tokenizr.ParsingError#line: Number
The (1-based) line number in the input. -
Property:
Tokenizr.ParsingError#column: Number
The (1-based) column number in the input. -
Property:
Tokenizr.ParsingError#input: String
The total input itself. -
Method:
Tokenizr.ParsingError#toString(): String
Returns a formatted representation of the error, usually for convenient error displaying purposes.
This is the class of all rule action contexts.
-
Method:
Tokenizr.ActionContext#data(key: String, value?: any): any
Store or retrieve any user data (indexed bykey
) to the action context for sharing data between two or more rules. -
Method:
Tokenizr.ActionContext#info(): { line: number, column: number, pos: number, len: number }
Retrieve information about the current matching. -
Method:
Tokenizr.ActionContext#push(state: String): Tokenizr
Method:Tokenizr.ActionContext#pop(): String
Method:Tokenizr.ActionContext#state(state: String): Tokenizr.ActionContext
Method:Tokenizr.ActionContext#state(): String
Method:Tokenizr.ActionContext#tag(tag: String): Tokenizr.ActionContext
Method:Tokenizr.ActionContext#tagged(tag: String): Boolean
Method:Tokenizr.ActionContext#untag(tag: String): Tokenizr.ActionContext
Methods just passed-through to the attachedTokenizr
object. See above for details. -
Method:
Tokenizr.ActionContext#repeat(): Tokenizr.ActionContext
Mark the tokenization process to repeat the matching at the current input position from scratch. You first have to switch to a different state withTokenizr.ActionContext#state()
or this will lead to an endless loop, of course! -
Method:
Tokenizr.ActionContext#reject(): Tokenizr.ActionContext
Mark the current matching to be rejected. The tokenization process will continue matching following rules. -
Method:
Tokenizr.ActionContext#ignore(): Tokenizr.ActionContext
Mark the current matching to be just ignored. This is usually used for skipping whitespaces. -
Method:
Tokenizr.ActionContext#accept(type: String, value?: any): Tokenizr.ActionContext
Accept the current matching and produce a token oftype
and optionally with a differentvalue
(usually a pre-processed variant of the matched text). This function can be called multiple times to produce one or more distinct tokens in sequence. -
Method:
Tokenizr.ActionContext#stop(): Tokenizr.ActionContext
Immediately stop entire tokenization. After this theTokenizr#token()
method immediately starts to returnnull
.
The pattern
passed to Tokenizr.{before,after,rule}()
has to be a
regular JavaScript RegExp
objects. Internally, Tokenizr creates a copy
of this object by skipping its g
(global) and y
(sticky) flags and
taking over its m
(multiline), s
(dotAll), i
(ignoreCase), and u
(unicode) flags.
Although Tokenizr is written in ECMAScript 6, it is transpiled to ECMAScript 5 and this way runs in really all(!) current (as of 2015) JavaScript environments, of course.
Internally, Tokenizr scans the input string in a read-only fashion
by leveraging RegExp
's g
flag (global, for ECMAScript <= 5
environments) or y
flag (sticky, for ECMAScript >= 2015 environments)
in combination with RegExp
's lastIndex
field.
The following alternatives are known:
-
moo: A very powerful tokenizer/lexer. It provides nearly the same functionality than Tokenizr. In addition, it compiles all regular expressions into a single one and hence is one of the fastest tokenizers/lexers.
-
flex-js A medium powerful tokenizer/lexer. It provides nearly the same functionality than Tokenizr.
-
lex: A tokenizer/lexer, modeled after the popular C solution Flex. This small library is similar in spirit, but not as flexible as it does not provide state supports.
Copyright (c) 2015-2024 Dr. Ralf S. Engelschall (http://engelschall.com/)
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.