Custom reStructuredText parser #5

brechtm · 2021-01-05T20:04:33Z

brechtm
Jan 5, 2021
Maintainer

In response to #1 (comment) by @Carreau:

For what it is worth for an unrelated (private) project I've started to rewrite my own RST parser even if incomplete (for now), and I don't think if I'm the only user that I'm going to support everything (for example I think that tables should be a directive)

It's one of my first parser so it's ugly and have a few requirement that you may not want for a stricter rst parser so I'm not ready to share it yet. I'm mostly trying to parse numpy docstring, which have a slight variant of RST in some sections so I have to have a couple of weird design decision.

I assume you are aware of napoleon which can parse NumPy docstrings?

I want a CST more than an AST as I'd like to be able to reformat existing RST.

I'd like to be able to parse and get an CST/AST without having to pre-register existing directives.

I'm not familiar with the differences between CSTs and ASTs (note to self: read Abstract vs. Concrete Syntax Trees ), but I recently also ran into the need to parse reStructuredText as-is.

docutils performs some transformations during parsing on the syntax tree. Some of these transformations can be disabled, but others are hard-coded. This makes it impossible to reconstruct the original reStructuredText for some elements (e.g. the role and contents directives). So far, I have been able to work around this by monkey-patching docutils.

It would of course be much better if docutils would first parse the reStructuredText into a tree that represents the source exactly. This would be a good candidate for cooperating on.

asmeurer · 2021-01-05T20:31:34Z

asmeurer
Jan 5, 2021

I'm not familiar with the differences between CSTs and ASTs

CSTs keep all the information about the original source, including whitespace. With a CST, it is possible to round-trip like source -> CST -> source and produce the exact same source code. ASTs typically only keep track of syntax and ignore whitespace when it is irrelevant. This simplifies things because you typically don't care about those things that don't actually affect the program, but it also limits what you can do.

An example would be the Python ast module, which provides an AST that isn't a CST. Using the new unparse function in 3.9

>>> import ast
>>> ast.unparse(ast.parse('x +  (1)'))
'x + 1'

Notice the irrelevant information of the extra space after the + and the redundant parentheses was lost. This is fine if all you are doing is analysing the source code, but if you want to do something like refactor the source code, you tend to want to leave everything you can intact.

0 replies

asmeurer · 2021-01-05T20:37:27Z

asmeurer
Jan 5, 2021

docutils performs some transformations during parsing on the syntax tree. Some of these transformations can be disabled, but others are hard-coded. This makes it impossible to reconstruct the original reStructuredText for some elements (e.g. the role and contents directives). So far, I have been able to work around this by monkey-patching docutils.

Can you show example code of what this looks like? I think I may have mentioned this in an earlier conversation, but for sphinx-math-dollar, we extend RST to allow using dollar signs for math (like $\sin(x)$). But there is a problem with absolute values, because docutils preprocesses |...| as substitutions before the place where we do our syntax transformation (sympy/sphinx-math-dollar#16). Is there a safe way to monkeypatch docutils for this extension only, so that we can avoid this issue, but also not disable |substitutions| that occur outside of dollar signs? (feel free to move this discussion to sympy/sphinx-math-dollar#16 as it is somewhat off topic here)

1 reply

brechtm Jan 6, 2021
Maintainer Author

To leave the substitution references as-is, no monkey-patch was needed. However, I did need to disable the standard transforms (references.Substitutions is probably the relevant one). That can be done by passing a docutils.readers.Reader instance to any of the docutils.core.publish*() functions. Note that this will yield substitution_reference nodes for each |...| encountered; it will just not perform the substitution. For example:

publish_file(source_path=rst_file_path,
             reader=Reader(),    # no transforms
             writer=MyCustomWriter())

I can do this in my application since it is parsing the reStructuredText directly. In sphinx-math-dollar, Sphinx will be performing the parsing, so you don't have this level of control. Also, monkey-patching docutils will very likely interfere with regular Sphinx operation, so that will not be possible either for you.

My advice, unhelpful as it may be, is to stick to custom roles and directives when extending reStructuredText.

Carreau · 2021-01-05T21:03:04Z

Carreau
Jan 5, 2021

I assume you are aware of napoleon which can parse NumPy docstrings?

Yes, I am aware, and I also contributed fixes to numpydoc. I'm actually currently using 1/2 numpydoc and 1/2 custom parsing.

Aaron pointed out what a CST is, and yes, mostly I want to be able to get back to original source for two reasons:

fix mistakes in syntax with heuristics when the parsing was wrong – unfortunately numpydoc also loose some informations
and 2) try to respect styling (3, vs 4 spaces for indent; or things that had user defined visual alignments 3) because some section have different syntax (as pointed out in 1).

Personally I would also appreciate to have the AST/CST that can be exported in a reliable way to JSON for potential processing in another language, or picked up by a different process later on – which should not be a problem, as long as we don't rely on shared instance later which I can see happening with replacement and references.

0 replies

asmeurer · 2021-01-06T00:51:05Z

asmeurer
Jan 6, 2021

fix mistakes in syntax with heuristics when the parsing was wrong

In that case you'd need something even more sophisticated than a CST. You need something that can handle incomplete/incorrect syntax. Usually you can only handle this sort of thing by looking at the token stream, which is often too low level to work with effectively. Perhaps something similar to parso could be built that can process "error nodes" without breaking the rest of the parsing.

0 replies

Carreau · 2021-01-06T02:08:59Z

Carreau
Jan 6, 2021

Well, RST tend to be relatively forgiving, and so far I've found that most of the trees I get are correct except the wrong nodes are detected due to issues with spacing/missing underscore in the right places. Worse case if I have a CST I can unparse just a section and run heuristics on it specifically.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reStructuredText

Custom reStructuredText parser #5

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

reStructuredText

Custom reStructuredText parser #5

brechtm Jan 5, 2021 Maintainer

Replies: 5 comments · 1 reply

asmeurer Jan 5, 2021

asmeurer Jan 5, 2021

brechtm Jan 6, 2021 Maintainer Author

Carreau Jan 5, 2021

asmeurer Jan 6, 2021

Carreau Jan 6, 2021

brechtm
Jan 5, 2021
Maintainer

Replies: 5 comments 1 reply

asmeurer
Jan 5, 2021

asmeurer
Jan 5, 2021

brechtm Jan 6, 2021
Maintainer Author

Carreau
Jan 5, 2021

asmeurer
Jan 6, 2021

Carreau
Jan 6, 2021