-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe we need to parse all token by external scanner #54
Comments
Yeah, I've reluctantly reached the same conclusion. When I started writing this grammar, naive as I was, I wanted to try to keep things as simple as possible but the various inline rules in This will be a big rewrite and I think we need a look-ahead for the whole block. For example, a single Still, this is the direction we need to go. |
But tree-sitter will serialize your scanner on every AST node even in the error AST. Looking ahead too much and keeping a large context might lead to a performance issue. |
I admit that I'm out of my depth and I don't know how you're supposed to design a tree-sitter grammar. My idea though wasn't to use error tokens at all and instead scan ahead to know what token to output. For example if a
If we implement this naively then we'll end up scanning the same inline context multiple times, but I hope we can cache that somehow. This does mean we need to be able to scan the entire inline context (including links etc) so we can decide if we should count a Maybe there's a better way to design this, I don't know. |
I’m not entirely sure what you’re referring to. Here, I’ll describe the potential issues I can think of more clearly: Tree-sitter frequently serializes and deserializes the scanner during the use of the external scanner. You can see the specifics here: https://github.com/tree-sitter/tree-sitter/blob/7583d394b41a9ab26ae6d52d51c8965f627996cd/lib/src/parser.c#L531-L576 To achieve incremental parsing, Tree-sitter stores the scanner’s state every time the scanner produces a token. This means we can’t actually store a lot of context information in the scanner, otherwise, the space overhead during the parsing process would be significant. Considering the amount of context information retained by the inline parser in the djot.js implementation (https://github.com/jgm/djot.js/blob/00d273e2c2d72b0eba7f463da0acf5ba2f9e4843/src/inline.ts#L527), I think looking ahead an entire block is unrealistic. |
Yeah I agree that we need to hold the scanner context to a minimum. I guess there's a trade-off here in repeatedly scanning the same information and storing it in the context. Scanning the entire inline context inside the external scanner doesn't seem to incur any serialization costs, that happens when the external scanner produces a token. If we can't look ahead then I don't see how we can decide between a longer and a shorter emphasis element? |
tree-sitter-markdown use two level tree-sitter-parser, one for block and another for inline. Can we refer to it? |
We can't simply refer to it as the markdown inline rules are different from the djot inline rules. Perhaps its a good idea to split the parser into two, the same as markdown though? It would make some things easier and maybe more performant, with the drawback of the user having to install two parsers instead of one. |
I've tried a few things:
There's still a lot of things left to implement, but the initial feeling is good. I've been able to solve token precedence issues such as parsing This will be a breaking change though and everyone would now need to update their existing queries and use two grammars. |
Maybe it's a more reasonable option to scanning the same information repeatedly? We just need to maintain a block stack and a inline stack. We only emit inline token when we can determine what the next token is. |
Yeah, I think I have a prototype of a solution I think should work. I'm only storing the inline stack (with some extra data on it) and we're scanning a little bit more (until the next ending element). This shouldn't be a performance bottleneck as we unfortunately have to branch out every time a beginning token is found via the treesitter LR conflict that should be a lot more expensive than the small extra scan we're doing. |
I've created a merge request from the I haven't created any "package files" in the root directory like there is in tree-sitter-markdown as everything just lives in their separate folders right now. If anyone can test it out that would be fantastic. |
As we know, although it's easy to parse djot by hand, djot is highly context sensitive, so parsing djot using tree-sitter is a hardwork. Thanks for your great work!
I notice this tree-sitter parses block level token by external scanner. However, it seems that inline level token also need to be parsed using external scanner. A lot of problems come from that (#38).
To achive that, we need to maintain two stacks, one block level and one inline level and parse block token and inline token using external scanner.
The text was updated successfully, but these errors were encountered: