title |
---|
Implementing Lexer |
The lexer, or lexical analyzer, defines how the contents of a file is broken into tokens. The lexer serves as a foundation for nearly all of the features of custom language plugins, from basic syntax highlighting to advanced code analysis features. The API for the lexer is defined by the Lexer interface.
The IDE invokes the lexer in three main contexts, and the plugin can provide different lexer implementations for these contexts:
-
Syntax highlighting: The lexer is returned from the implementation of the SyntaxHighlighterFactory interface which is registered in the
com.intellij.lang.syntaxHighlighterFactory
extension point. -
Building the syntax tree of a file: the lexer is expected to be returned from ParserDefinition.createLexer(), and the ParserDefinition interface is registered in the
com.intellij.lang.parserDefinition
extension point. -
Building the index of the words contained in the file: if the lexer-based words scanner implementation is used, the lexer is passed to the DefaultWordsScanner constructor.
The lexer used for syntax highlighting can be invoked incrementally to process only the changed part of a file, whereas lexers used in other contexts are always called to process an entire file, or a complete language construction embedded in a file in a different language.
A lexer that can be used incrementally may need to return its state, which means the context corresponding to each position in a file.
For example, a
Java lexer
could have separate states for top level context, comment context and string literal context.
An important requirement for a syntax highlighting lexer is that its state must be represented by a single integer number returned from
Lexer.getState().
That state will be passed to the
Lexer.start()
method, along with the start offset of the fragment to process, when lexing is resumed from the middle of a file.
Lexers used in other contexts can always return 0
from the getState()
method.
The easiest way to create a lexer for a custom language plugin is to use JFlex.
Adapter classes,
FlexLexer
and
FlexAdapter
adapt JFlex lexers to the IntelliJ Platform Lexer API.
The source code of
IntelliJ IDEA Community Edition
includes a patched version of JFlex 1.4 located in tools/lexer/jflex-1.4 and lexer skeleton file tools/lexer/idea-flex.skeleton which can be used for creating lexers compatible with
FlexAdapter.
The patched version of JFlex provides a new command line option --charat
which changes the JFlex generated code so that it works with the IntelliJ Platform skeleton.
Enabling --charat
option passes the source data for lexing as a
CharSequence
and not as an array of characters.
For developing lexers using JFlex, the JFlex Support plugin can be useful. It provides syntax highlighting and other useful features for editing JFlex files. GrammarKit plugin also has builtin JFlex support.
Note: Lexers, and in particular JFlex-based lexers, need to be created in such a way that they always match the entire contents of the file, without any gaps between tokens, and generate special tokens for characters which are not valid at their location. Lexers must never abort prematurely because of an invalid character.
Example: Lexer definition for Properties language plugin
Types of tokens for lexers are defined by instances of IElementType. A number of token types common for all languages are defined in the TokenType interface. Custom language plugins should reuse these token types wherever applicable. For all other token types, the plugin needs to create new IElementType instances and associate with the language in which the token type is used. The same IElementType instance should be returned every time a particular token type is encountered by the lexer.
Example: Token types for Properties language plugin
An important feature which can be implemented at lexer level is mixing languages within a file, for example, embedding fragments of Java code in some template language. If a language supports embedding its fragments in another language, it needs to define the chameleon token types for different types of fragments which can be embedded, and these token types need to implement the ILazyParseableElementType interface. The lexer of the enclosing language needs to return the entire fragment of the embedded language as a single chameleon token, of the type defined by the embedded language. To parse the contents of the chameleon token, the IDE will call the parser of the embedded language through a call to ILazyParseableElementType.parseContents().