-
Notifications
You must be signed in to change notification settings - Fork 29.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a proper tokenization/colorization API to extensions #1967
Comments
I too support being able to replace the tokenizer implementation. For example, since I was working on first-class features for a language that I wanted to implement, I had to create an AST and its associated lexer/parser (which was based on flex/bison by the way). It's a shame that I can't reuse the lexer for syntax highlighting as well. Returning the token/position and saving the lexer state doesn't seem that difficult.
This part doesn't really make sense to me though. |
I think they wouldn't be considered as tokenizers from VSCode's point of view. Expanding on @dajoh's example, you could have a tmLanguage tokenizer classifying the simple tokens quickly and have Roslyn parse and return detailed token information from another process. In this case VSCode might only need a way to change a token's style at any time. |
@alexandrudima fyi |
👍 This is a great request. |
related to #580 |
I noticed that the project TypeScript-tmLanguage is not "actively" maintained, can I assume that TypeScript will be one of the first gainer on this set of API? |
Just wondering if there's any update on this? |
Sorry, nothing to report yet... |
I don't suppose it's likely this will happen any time soon? |
Hope this feature would have higher priority. It will make vscode the best code editor for me. |
Has there been any update on this? |
Would love to see this added in to allow syntax highlighting for Powershell at the same level that ISE has!!! This would allow me to never touch ISE again, as I currently have to use it to sanity check certain items that aren't highlighting as expected in Code |
I am very interested in a tokenization / highlighting API. I am developing a lexer generator for multiple output targets and would like to support vscode as a first class output. I put some serious effort into trying to generate regex definitions but because the generated lexer is really a state machine it got very messy. The key requirement for me would be that the API allow the lexer to maintain state across different lines of source code, e.g. a stack of context. This is necessary to support grammars that are not strictly regular, e.g. nested string interpolation syntaxes (for example, swift lets you nested multiple levels of string interpolation, so the lexer needs a stack to switch between code and string lexing modes; multiline string literals require this state persist across lines). Another aspect that would be great to see would be documentation on Unicode correctness. For example, I assume that the API would be operating on JavaScript UCS2 strings, and so code points outside of the BMP would be represented as surrogate pairs. Are these counted as 1 or 2 columns by the highlighting engine? This is also important for problem matchers. This stuff gets hard (e.g. deciding how wide a character will render is involved) so I wouldn't expect it to be perfect at first, but it's worth keeping these challenges in mind during the design phase. |
The Microsoft C++ extension is also very interested in this. At the very least, we would like a way to colorize sections of code to mark them as inactive based on #ifdef/#else/#endif/etc sections. It's something that Visual Studio can do, but unfortunately we can't do this with TextMate grammars since the tokens need to be evaluated by the compiler, not regular expressions. |
Actually a dupe of #585 |
Rich context-sensitive syntax colorization is very hard to do (if not impossible) with tmLanguage syntax definitions. The functionality for specifying custom colorizers seems to be there, but not exposed to extensions (ITokenizationSupport).
One way to expose colorization would be to just let the extension provide an ITokenizationSupport implementation, and have that completely override the tmLanguage syntax definition (if any).
Another way is to let multiple tokenizers work in parallel, each classifying different tokens of the program. For example: A tmLanguage based tokenizer is used to classify easy tokens such as keywords, strings, and literals. A custom tokenizer is used to classify tokens such as identifiers (which generally need context information, think type names). A reason for wanting this is is that classifying tokens such as identifiers is generally much slower than keywords, separating the tokenizers allow for instant colorization on easy tokens to classify, but harder tokens like identifiers are classified in the background and will eventually be colorized, when ready. It might make sense to only allow one tokenizer per language, but allow for multiple (potentially async) token classifiers.
The text was updated successfully, but these errors were encountered: