Maybe we need to parse all token by external scanner #54

wrvsrx · 2024-07-21T04:51:26Z

As we know, although it's easy to parse djot by hand, djot is highly context sensitive, so parsing djot using tree-sitter is a hardwork. Thanks for your great work!

I notice this tree-sitter parses block level token by external scanner. However, it seems that inline level token also need to be parsed using external scanner. A lot of problems come from that (#38).

To achive that, we need to maintain two stacks, one block level and one inline level and parse block token and inline token using external scanner.

At eol, we need to determine how many block we need to close and what block we need to start by peeking next line.
At each block start, we need to determine whether we should start to parse inline or start another block.
When we parsing inline, we need to push all possible inline openers into the inline stack and change them to actural token when we parse a potential closer.

treeman · 2024-07-21T18:47:11Z

Yeah, I've reluctantly reached the same conclusion. When I started writing this grammar, naive as I was, I wanted to try to keep things as simple as possible but the various inline rules in grammar.js are already too complex, and I don't really see a way to solve the issues properly without handling them in an external scanner.

This will be a big rewrite and I think we need a look-ahead for the whole block. For example, a single _ shouldn't start an emphasis and we can only know if there's a second _ in the block if we parse until the end-of-block (paragraph end, div end, etc). It should be possible but there are quite a lot of tricky edge-cases here so it'll take some time to implement.

Still, this is the direction we need to go.

black-desk · 2024-08-03T06:14:08Z

But tree-sitter will serialize your scanner on every AST node even in the error AST. Looking ahead too much and keeping a large context might lead to a performance issue.

treeman · 2024-08-06T12:27:23Z

I admit that I'm out of my depth and I don't know how you're supposed to design a tree-sitter grammar.

My idea though wasn't to use error tokens at all and instead scan ahead to know what token to output.

For example if a * is found:

Scan the entire inline context.
If another * is found, we keep track of that and continue scanning.
At the end of the context, we can decide if the first * starts a strong element, or if that should be left to another shorter * pair that comes after.

If we implement this naively then we'll end up scanning the same inline context multiple times, but I hope we can cache that somehow.

This does mean we need to be able to scan the entire inline context (including links etc) so we can decide if we should count a * or not. Maybe this means completely forgoing any definitions in grammar.js... The unfortunate thing about an external scanner is that it sort of side-steps tree-sitters backtracking algorithm, but I'm really not sure how that interaction works.

Maybe there's a better way to design this, I don't know.

black-desk · 2024-08-06T13:30:28Z

I’m not entirely sure what you’re referring to. Here, I’ll describe the potential issues I can think of more clearly:

Tree-sitter frequently serializes and deserializes the scanner during the use of the external scanner. You can see the specifics here: https://github.com/tree-sitter/tree-sitter/blob/7583d394b41a9ab26ae6d52d51c8965f627996cd/lib/src/parser.c#L531-L576

To achieve incremental parsing, Tree-sitter stores the scanner’s state every time the scanner produces a token.

This means we can’t actually store a lot of context information in the scanner, otherwise, the space overhead during the parsing process would be significant.

Considering the amount of context information retained by the inline parser in the djot.js implementation (https://github.com/jgm/djot.js/blob/00d273e2c2d72b0eba7f463da0acf5ba2f9e4843/src/inline.ts#L527), I think looking ahead an entire block is unrealistic.

treeman · 2024-08-09T11:44:17Z

Yeah I agree that we need to hold the scanner context to a minimum.

I guess there's a trade-off here in repeatedly scanning the same information and storing it in the context. Scanning the entire inline context inside the external scanner doesn't seem to incur any serialization costs, that happens when the external scanner produces a token.

If we can't look ahead then I don't see how we can decide between a longer and a shorter emphasis element?

wrvsrx · 2024-08-19T10:15:12Z

tree-sitter-markdown use two level tree-sitter-parser, one for block and another for inline. Can we refer to it?

treeman · 2024-08-20T04:01:52Z

We can't simply refer to it as the markdown inline rules are different from the djot inline rules.

Perhaps its a good idea to split the parser into two, the same as markdown though? It would make some things easier and maybe more performant, with the drawback of the user having to install two parsers instead of one.

treeman · 2024-08-26T08:29:41Z

I've tried a few things:

Separate the scanner into two grammars, like tree-sitter-markdown
Move some inline tokens to an external scanner

There's still a lot of things left to implement, but the initial feeling is good. I've been able to solve token precedence issues such as parsing *not strong *strong* and with an external scanner we'll be able to ignore elements containing only delimiter tokens such as ___.

This will be a breaking change though and everyone would now need to update their existing queries and use two grammars.

wrvsrx · 2024-08-27T12:06:51Z

I guess there's a trade-off here in repeatedly scanning the same information and storing it in the context. Scanning the entire inline context inside the external scanner doesn't seem to incur any serialization costs, that happens when the external scanner produces a token.

Maybe it's a more reasonable option to scanning the same information repeatedly? We just need to maintain a block stack and a inline stack. We only emit inline token when we can determine what the next token is.

treeman · 2024-08-27T12:26:53Z

Maybe it's a more reasonable option to scanning the same information repeatedly? We just need to maintain a block stack and a inline stack. We only emit inline token when we can determine what the next token is.

Yeah, I think I have a prototype of a solution I think should work. I'm only storing the inline stack (with some extra data on it) and we're scanning a little bit more (until the next ending element).

This shouldn't be a performance bottleneck as we unfortunately have to branch out every time a beginning token is found via the treesitter LR conflict that should be a lot more expensive than the small extra scan we're doing.

treeman · 2024-09-09T06:19:10Z

I've created a merge request from the split branch that implements the split parser.

I haven't created any "package files" in the root directory like there is in tree-sitter-markdown as everything just lives in their separate folders right now.

If anyone can test it out that would be fantastic.

treeman mentioned this issue Sep 9, 2024

Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

Closed

treeman mentioned this issue Jan 28, 2025

Backport split parser changes #56

Merged

treeman closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe we need to parse all token by external scanner #54

Maybe we need to parse all token by external scanner #54

wrvsrx commented Jul 21, 2024 •

edited

Loading

treeman commented Jul 21, 2024

black-desk commented Aug 3, 2024

treeman commented Aug 6, 2024

black-desk commented Aug 6, 2024

treeman commented Aug 9, 2024

wrvsrx commented Aug 19, 2024 •

edited

Loading

treeman commented Aug 20, 2024

treeman commented Aug 26, 2024

wrvsrx commented Aug 27, 2024

treeman commented Aug 27, 2024

treeman commented Sep 9, 2024

Maybe we need to parse all token by external scanner #54

Maybe we need to parse all token by external scanner #54

Comments

wrvsrx commented Jul 21, 2024 • edited Loading

treeman commented Jul 21, 2024

black-desk commented Aug 3, 2024

treeman commented Aug 6, 2024

black-desk commented Aug 6, 2024

treeman commented Aug 9, 2024

wrvsrx commented Aug 19, 2024 • edited Loading

treeman commented Aug 20, 2024

treeman commented Aug 26, 2024

wrvsrx commented Aug 27, 2024

treeman commented Aug 27, 2024

treeman commented Sep 9, 2024

wrvsrx commented Jul 21, 2024 •

edited

Loading

wrvsrx commented Aug 19, 2024 •

edited

Loading