Adjust raw text extraction from docutils documents #33

rtobar · 2021-11-30T03:05:22Z

The previous version of this code relied on the Text.rawsource attribute
to obtain the raw, original version of the translated texts contained in
.po files. This attribute however was removed in docutils 0.18, and thus
a different way of obtaining this information was needed.

(Note that this attribute removal was planned, but not for this release
yet: it's currently listed not in 0.18's list of changes, but under
"Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been
opened to get this eventually clarified)

The commit that removed the Text.rawsource mentioned that the data fed
into the Text elements was already the raw source, hence there was no
need to keep a separate attribute. Text objects derive from str, so we
can directly add them to the list of strings where NodeToTextVisitor
builds the original text, with the caveat that it needs to have
backslashes restored (they are encoded as null bytes after parsing,
apparently).

The other side-effect of using the Text objects directly instead of the
Text.rawsoource attribute is that now we get more of them. The document
resulting from docutils' parsing can contain system_message elements
with debugging information from the parsing process, such as warnings.
These are Text elements with no rawsource, but with actual text, so we
need to skip them. In the same spirit, citation_references and
substitution_references need to be ignored as well.

All these changes allow pospell to work against the latest docutils. On
the other hand, the lowest supported version is 0.16: 0.11 through 0.14
failed at rfc role parsing (used for example in the python docs), and
0.15 didn't have a method to restore backslashes (which again made the
python docs fail).

This addresses #30.

Signed-off-by: Rodrigo Tobar rtobar@icrar.org

The previous version of this code relied on the Text.rawsource attribute to obtain the raw, original version of the translated texts contained in .po files. This attribute however was removed in docutils 0.18, and thus a different way of obtaining this information was needed. (Note that this attribute removal was planned, but not for this release yet: it's currently listed not in 0.18's list of changes, but under "Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been opened to get this eventually clarified) The commit that removed the Text.rawsource mentioned that the data fed into the Text elements was already the raw source, hence there was no need to keep a separate attribute. Text objects derive from str, so we can directly add them to the list of strings where NodeToTextVisitor builds the original text, with the caveat that it needs to have backslashes restored (they are encoded as null bytes after parsing, apparently). The other side-effect of using the Text objects directly instead of the Text.rawsoource attribute is that now we get more of them. The document resulting from docutils' parsing can contain system_message elements with debugging information from the parsing process, such as warnings. These are Text elements with no rawsource, but with actual text, so we need to skip them. In the same spirit, citation_references and substitution_references need to be ignored as well. All these changes allow pospell to work against the latest docutils. On the other hand, the lowest supported version is 0.16: 0.11 through 0.14 failed at rfc role parsing (used for example in the python docs), and 0.15 didn't have a method to restore backslashes (which again made the python docs fail). Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

JulienPalard · 2021-11-30T16:56:59Z

I have no issue with docutils>=0.16 so it's all good \o/

Thanks!

JulienPalard merged commit c4feb4d into AFPy:main Nov 30, 2021

rtobar deleted the docutils-0.18 branch July 10, 2022 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust raw text extraction from docutils documents #33

Adjust raw text extraction from docutils documents #33

rtobar commented Nov 30, 2021

JulienPalard commented Nov 30, 2021

Adjust raw text extraction from docutils documents #33

Adjust raw text extraction from docutils documents #33

Conversation

rtobar commented Nov 30, 2021

JulienPalard commented Nov 30, 2021