Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

treeman · 2024-09-09T06:16:03Z

This is a major rewrite containing two large changes and many smaller improvements.

The biggest change is that it splits the parser in two (like tree-sitter-markdown) where two parsers must be used to parse a Djot file. This is a big breaking change as many capture groups have also changed so all query files needs to be updated.

The other big change is to let the external parser collect a stack of inline elements, solving a bunch of issues from the Djot spec (such as parsing *not strong *strong* properly).

In the process I also fixed a bunch of other issues.

Closes #41, #42, #43, #44, #45, #46, #49, #50, #52, #53, #54

Also add some fixes: - Arbitrary depth for headings - Allow paces around table separator - Support link reference definitions next to each other with arbitrary inline content.

There's real promise here!

Try to update package description a bit... Not sure about them really.

clason · 2024-10-22T08:09:02Z

@treeman Can you explain what happened here? It looks like the master branch now has the (WIP?) split parser, which broke nvim-treesitter, but this PR was not merged.

clason · 2024-10-22T08:10:44Z

And while I'm here: tree-sitter build complains about

Warning: Found non-static non-tree-sitter functions in the external scannner
  `_init`
  `_set_delayed_token`
Consider making these functions static, they can cause conflicts when another tree-sitter project uses the same function name

treeman · 2024-10-22T09:19:24Z

@treeman Can you explain what happened here? It looks like the master branch now has the (WIP?) split parser, which broke nvim-treesitter, but this PR was not merged.

Oh crap... I must've accidentally pushed the split branch into master somehow. That's what I get for not making master protected. I've reverted master now.

Anyway, this branch is should be stable and I've been using it for a few months without issues. Top-level package.json and similar files are missing (markdown has them) and I'm not sure how to test those properly...

But we should be able to make nvim-treesitter use the split parser approach (we need to update the grammars though). This should fix a bunch of bugs and hopefully be faster in some cases.

clason · 2024-10-22T09:41:14Z

Oh crap... I must've accidentally pushed the split branch into master somehow. That's what I get for not making master protected. I've reverted master now.

Thank you!

Anyway, this branch is should be stable and I've been using it for a few months without issues. Top-level package.json and similar files are missing (markdown has them) and I'm not sure how to test those properly...

Just do tree-sitter generate (or tree-sitter init --update) with the latest CLI 0.24.3 and you'll get all those (and more)... I don't think you need to test them, but you could look at the workflows used in https://github.com/tree-sitter-grammars/template.

But we should be able to make nvim-treesitter use the split parser approach (we need to update the grammars though). This should fix a bunch of bugs and hopefully be faster in some cases.

Sure! Someone needs to make a PR to update the parsers and -- more importantly -- the queries, though. (And that someone has to be you, I'm afraid ;))

Also some refactoring

Solves different precedence issues for spans and links

treeman · 2025-01-25T11:40:28Z

The split parser is basically done.

However after playing around with the bindings I realized that using a split parser gives a fairly negative user experience:

Users needs to install two grammars instead of just one
The parse tree is quite ugly with lots of extra inline nodes
It's annoying to have multiple grammars if you want to do something custom using the parser (such as tree-sitter supported transformations)

So I'll try to backport the fixes and features to the single grammar version. I don't know if I can do that in a good way but I want to try it before committing to the more awkward solution of multiple grammars.

Complete rewrite of the inline parser together with many changes for the block parsing. Should make the parser follow the spec a lot closer in particular with inline precedence rules. #55

treeman · 2025-01-28T09:33:28Z

Closed in favor of #56

treeman added 25 commits August 20, 2024 09:05

Move files into their own block level directory

e5bde4c

Remove inline parsing elements from grammar

0f04575

Also add some fixes: - Arbitrary depth for headings - Allow paces around table separator - Support link reference definitions next to each other with arbitrary inline content.

WIP First iteration of djot inline parser

eeab9dc

Fix div inside blockquote

f38156a

Inline parser for verbatim

5dcfb87

WIP external scanner for emphasis attempt

6cc531a

There's real promise here!

WIP Almost all emphasis tests pass

394347b

Some cleanup

17b3d59

Emphasis prototype that passes the weird test cases I could find

5660f6c

Merge verbatim with open elements

4cc2dd1

Mark end at the beginning in scanner

ef94ad6

Some more comments

47d1b88

Move all span elements to external scanner

b7d9695

Update highlights

85cf546

Move lots of elements to external scanner

84e0fc7

Try to update package description a bit... Not sure about them really.

Move footnote markers to external scanner

d33cce9

Update comments and refactor a little

d6a62e2

Fix concealment and highlight of delimiters

9700795

Refactor bracket parsing

5ce1e6e

Ability to escape ) in link urls

b915d05

Fix dynamic precedence and refactor things a bit

aeb4754

\\ before newline shouldn't be a hard linebreak

631db8f

Fallback for [^foo

e4122c2

Remove locals, not sure what to do with a split parser

34928cf

Fix highlights

63dc041

feat: Stop elements inside inline urls and update to 0.24.6

fa74cc4

treeman added 7 commits January 9, 2025 11:03

refactor: Separate scanning and parsing end markers (except verbatim)

29ee257

feat: Correctly handle precedence for *[*](y)

e08cd93

Also some refactoring

feat: Upgrade djot to tree-sitter 0.24.6

c19e267

fix: only consider single alpha characters as list markers

49bcc15

docs: Update readme

3473e07

feat: Add fields to nodes and use inline for injection

0b241cc

feat: Smarter detection of inline link scanning

bb26458

Solves different precedence issues for spans and links

treeman force-pushed the split branch from 1b9f720 to bb26458 Compare January 10, 2025 09:21

fix: Remove unused verbatim field

e10ad91

zetashift mentioned this pull request Jan 17, 2025

Add Djot support helix-editor/helix#12562

Open

treeman added 12 commits January 22, 2025 21:28

fix: Prevent content from jumping out of lists

d7132b7

test: Two more broken list cases

225169f

fix: Better spacer detection inside lists

aeb3fc3

feat: Proper block parsing of footer content (like in lists)

5426b10

fix: Ending markers for language injection regex

0036a74

fix: Prefer inline attributes with text directly following

10b04b4

perf: Rewrite table to use external scanner to avoid excessive branching

260729d

fix: Escape pipes and verbatim support in table cells

ce0d52f

perf: Scan block attributes via external scanner for ~4x speedup

385e7e0

fix: Narrow verbatim in table cells

487a4bc

test: More failing block attribute formats

1b53d70

feat: Proper handling of newlines in comments

5e654be

treeman mentioned this pull request Jan 28, 2025

Backport split parser changes #56

Merged

treeman closed this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

treeman commented Sep 9, 2024

clason commented Oct 22, 2024

clason commented Oct 22, 2024

treeman commented Oct 22, 2024

clason commented Oct 22, 2024

treeman commented Jan 25, 2025

treeman commented Jan 28, 2025

Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

Draft: Move inline tokens to external scanner and split the grammar into two pieces #55

Conversation

treeman commented Sep 9, 2024

clason commented Oct 22, 2024

clason commented Oct 22, 2024

treeman commented Oct 22, 2024

clason commented Oct 22, 2024

treeman commented Jan 25, 2025

treeman commented Jan 28, 2025