Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work in lexer-parser feature. #85

Closed
UzielSilva opened this issue Sep 3, 2014 · 16 comments
Closed

Work in lexer-parser feature. #85

UzielSilva opened this issue Sep 3, 2014 · 16 comments

Comments

@UzielSilva
Copy link
Contributor

Hi Tomas,
I will work in syntax highlighting feature via lexer-parser(Using ANTLRv4 and https://github.com/antlr/grammars-v4).
Is there a problem if after done I create a pull request about this?

@TomasMikula
Copy link
Member

Hi Uziel,

I've been wanting to have it for a long time! I would rather keep RichTextFX low-level and versatile, not bound to any specific parser implementation. I would encourage you to start a separate project that depends on both RichTextFX and ANTLR (and the grammars) and provides syntax highlighting out of the box. I am myself thinking of starting such a project using the Papa Carlo incremental parser when I find some time. Nevertheless, I highly support your ANTLR effort, since there are ANTLR grammars available for many languages, while Papa Carlo is still experimental and I have only seen JSON and Java grammars for it. I will provide a link to your project on RichTextFX page if you decide to go this way.

Best,
Tomas

@UzielSilva
Copy link
Contributor Author

Papa Carlo sounds amazing!, I'm now want to learn about this.
I'll try to work with that tool, and search how to adjust existing ANTLR parsers.
If I can't learn, i'll work with ANTLR

@jeffreyguenther
Copy link
Contributor

Hi Uziel and Tomas,

For my thesis, I need to add syntax highlighting to an editor based on the language I'm developing. Yesterday, I whipped a prototype based on Tomas' JavaKeywordAsync example that might help this conversation. You can see it here.

My basic approach is to use an ANTLR4 lexer to generate tokens for the editor's text. I map the token's type to css class to be applied to my code area instance. ANTLR's Token provides the the start and end position of the token in text. I use these positions to create the StyleSpans.

    private static StyleSpans<Collection<String>> computeHighlighting(String text){
        StyleSpansBuilder<Collection<String>> spansBuilder = new StyleSpansBuilder<>();
        int lastEnd = 0;

        ShiroLexer lex = new ShiroLexer(new ANTLRInputStream(text));
        // parse

        for(Token t: lex.getAllTokens()){
            spansBuilder.add(Collections.emptyList(), t.getStartIndex() - lastEnd);
            spansBuilder.add(Collections.singleton(getStyleClass(t)), 
                  (t.getStopIndex() + 1 - t.getStartIndex()));
            lastEnd = t.getStopIndex() + 1;
        }

        spansBuilder.add(Collections.emptyList(), text.length() - lastEnd);

        return spansBuilder.create();
    }

With regard to supporting a variety of languages in RichTextFX, it shouldn't be too difficult. I haven't had the time to write up the code yet, but this would be my approach. I would experiment with the parser and lexer interpreters provided by the ANTLR4 runtime. These allow you to load a grammar (combined, parser, or lexer) from a file at runtime. The challenge comes in how you take the parse tree or the token stream and turn it into StyleSpans. You need to know something about the token types or the grammar before you write the code that maps token types to css classes or else you won't know how to do the mapping.

Now that I think of it, this probably means you'd be better off building something like pygments based on ANTLR4. You would collect all the grammars you want to support (ANTLR has repo of different grammars), generate parsers and lexers for them and then write a class to convert those token streams or parse trees (which ever you choose) to StyleSpans. You could generate style mappings that allow you to reuse pygments stylesheets. In fact, it shouldn't be too much work to write a script or program that converts a pygments stylesheet to a JavaFX stylesheet.

Regarding an incremental parser, I've only started to think about how I would make syntax highlighting fast and efficient. I found this post by one of the ANTLR developers useful.

@jeffreyguenther
Copy link
Contributor

Quick update:

Here is a ANTLR4 parser based implementation of a syntax highlighter for my language. @UzielSilva, you should be able to use the technique to adapt it to a language of your choice.

I'm starting on a more generic version that will allow you to choose between a number of different languages using the techniques described in my previous post. Watch Xanthic for progress.

@TomasMikula
Copy link
Member

Hi Jeff,

nice work!

and then write a class to convert those token streams or parse trees (which ever you choose) to StyleSpans.

Does this mean that you need to write a class manually for each of the supported languages? Wouldn't it possible to require just a mapping (e.g. Map<String, String>) from token types to pygment styles to be specified manually for each language?

@jeffreyguenther
Copy link
Contributor

Hi Tomas,

Thanks!

If you only want to use a lexer, you're right. You would just need a mapping between token types and styles to generate the StyleSpans.

If you want the grammar to determine the highlighting, you'll need to walk the parse tree and decide in each rule the style to assign to the token. I think in many cases using just a lexer will be enough; however, if you want to assign the same token different styles depending on it's grammatical function, you'll need to write either a ANTLR visitor, or tree listener.

For example:
Here's are two ANTLR grammar rules.

nodestmt
    :   NODE MFNAME ('[' activeSelector ']')? BEGIN NEWLINE
        nodeInternal
        END
    ;

portDecl
    :   portType portName MFNAME
    ;

MFNAME: UCLETTER (LCLETTER | UCLETTER | DIGIT|'_')*

If I want to style the MFNAME token in the nodestmt rule differently than I want to style it in the portDecl, I'll need to walk the tree to associate different styles with these tokens. It allows you to do grammar sensitive highlighting.

My thinking with Xanthic is to build a pipeline like:

  1. Lex using an ANTLR lexer
  2. Pass it into a Highlighter class that does the conversion. This would hide whether the style conversion is happening by iterating through a token list or walking a parse tree. This result would be a generic internal format that can be used by formatters.
  3. The Formatter would consume this output and generate the appropriate output(StyleSpans, HTML, postscript, latex, etc)

@TomasMikula
Copy link
Member

Thanks for clarification.

Still, isn't there a way to walk the parse tree in a generic way and choose styles based on the names of the grammar rules? The mapping from tokens to styles could then be specified as

"nodestmt/MFNAME" -> some_style
"portDecl/MFNAME" -> another_style

or something like that.

On the other hand, even grammar-sensitive highlighting doesn't give you all you might want, anyway. For example, one may want different styles for an IDENTIFIER in Java, depending on whether it is a type name, static field, non-static field, final/non-final, local variable, method name, ... You may recognize some of them (e.g. if it's in a type position, then it's a type name), but not all of them, at the grammar level. You would need type-checking for that, but this is how far you get with a syntax-highlighting editor.

@jeffreyguenther
Copy link
Contributor

ANTLR does support a subset of XPath for identifying parse tree nodes, but I haven't played with it yet. It might be possible to do what you describe using XPath.

Can you walk me through some of the API use cases you have in mind? How would you like syntax highlighting to work?

You would need type-checking for that, but this is how far you get with a syntax-highlighting editor.

Yes, you're right. The more info about the language you want to use to inform style decisions the more you'll inch towards writing a sort of interpreter to do the highlighting. With ANTLR writing interpreters of this style are pretty easy.

@TomasMikula
Copy link
Member

At the high level, I imagine something like this:

class SyntaxArea extends CodeArea {
    public void setSyntax(Syntax syntax);
}

class Syntax {
    public Syntax(Grammar grammar, Map<String, String> styleMap);
}

styleMap is a mapping from XPath expressions to pygments style classes.

I'm not sure what Grammar should be, though. Does ANTLR have a class for representing grammars? Is ANTLR able to generate a parser a runtime, given the grammar?

@jeffreyguenther
Copy link
Contributor

Yes, ANTLR can interpreter grammars at runtime. Grammar would be ANTLR's Grammar.

This would allow someone to load a grammar at runtime from a file. The only drawback of this approach is not being able to walk the parse tree, but I don't that will be an issue because you'd be using XPaths to get the parse tree nodes rather than walking the tree.

@TomasMikula
Copy link
Member

Looks good to me!

@jeffreyguenther
Copy link
Contributor

Ok, give me a couple days to get a prototype built. One more question. Is this something you want in the RichTextFX repo, or would you rather I built it as a separate library people can add if they want it? In other words, do you want to add ANTLR as a dependency to RichTextFX?

@TomasMikula
Copy link
Member

As I mentioned in the comment to Uziel above, I would rather keep it a separate project, for

  1. I would like to give users freedom to use their parser of choice.
  2. The dependency on ANTLR is unnecessary for some applications, e.g. rich text applications.

@jeffreyguenther
Copy link
Contributor

Sounds good. Just wanted to double check. I'll post here when I have something to share.

@jeffreyguenther
Copy link
Contributor

The code is a bit rough at the moment, but I built a proof of concept. You can find it at https://github.com/jrguenther/Xanthic. Clone the repo and run gradle run. For feature requests and bug reports, let's use that repo's issue tracker.

Enjoy!

@TomasMikula
Copy link
Member

Good job, keep it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants