be more selective about escaping special characters #122

chrispy-snps · 2024-04-14T15:20:31Z

This is a refinement of #118 (thanks @jsm28!).

The current solution escapes every instance of every special character. Although conservative, this can lead to unnecessary escaping. For example,

>>> from markdownify import markdownify as md

>>> md('Pick a color (just 1) to use.')
'Pick a color (just 1\\) to use.'

>>> md('Pick a color, just 1. Write it down.')
'Pick a color, just 1\\. Write it down.'

>>> md('1 + 1 = 2')
'1 \\+ 1 \\= 2'

>>> md('start ----> end')
'start \\-\\-\\-\\-\\> end'

In our use case, our input content is technical documentation (many special characters) and the content is subsequently edited by humans, so it is desirable to minimize unnecessary escaping.

This pull request seeks to strike a balance between the following:

Maximizing required escaping
Minimizing unnecessary escaping
Minimizing code complexity
Minimizing runtime penalty (about 25% overall)

The tests cover a variety of required and unnecessary escaping cases, which can hopefully avoid any future regressions in escaping behavior.

This approach is not foolproof. Markdownify processes each text fragment in isolation, and thus the beginning of a particular string might not be the beginning of an output line. As a result, patterns are not applied across text fragment boundaries (such as adjacent  elements). Handling this probably requires a larger rework of the text processing code.

I also noticed that the original code had def text_misc() instead of def test_misc(), which caused the tests never to run.

chrispy-snps · 2024-04-14T15:26:21Z

@jsm28, @AlexVonB - what are your thoughts? I understand this is not as comprehensive as the "escape-everything" approach, but I'm hoping it strikes a balance between catching realistic scenarios while avoiding unnecessary scenarios. If issues are filed for false negatives or false positives, the approach can be refined (and the test set updated accordingly).

jsm28 · 2024-04-14T17:54:51Z

Being at start of line isn't a very safe test at this stage either, because wrapping takes place later and can move text to start of line that only has significance there, so anything after a space should effectively be considered to be at start of line when wrapping, and this affects most of the examples given in this issue. I was considering making the escaping smarter anyway because there are lots of examples in my use case (converting 547 issues filed against past versions of the C standard and related documents for a new issue tracker) where it can safely be determined that the escaping is unnecessary in any position, such as 1.2.3.4 or X-Y. (My preprocessing removes all  - or converts it into other markup after processing a subset of the CSS found in some of the input - before the input gets passed to markdownify and I don't think fragment boundaries can be an issue in my case.)

(Yes, wrapping is significantly buggy at present, in particular losing   in the input; that's on my list of issues to fix.)

chrispy-snps · 2024-04-14T23:28:48Z

@jsm28 - aww dangit, I forgot about wrapping. I wonder if any of the work in this pull request is salvageable?

What are your thoughts on how to make escaping smarter? Perhaps we could escape everything, then apply a set of regular expressions after wrapping that removes unnecessary escapes (although this doesn't give me the warmest fuzzy feeling).

Possibly related, I've wanted to figure out a fix for adjacent  elements not getting a space between them in table cells:

>>> from markdownify import markdownify as md
>>> html_doc = """
... <table>
...   <tr>
...     <td>
...       <p>abc</p><p>def</p>
...     </td>
...   </tr>
... </table>
... """
>>> md(html_doc)
'\n\n\n| abcdef |\n| --- |\n\n\n'

jsm28 · 2024-04-15T03:06:22Z

FIxing test function naming (plus any consequent fixes needed for tests to pass) is definitely salvageable!

I was thinking mainly about . and - as the main cases I saw where escaping obviously wasn't needed (my input data has a lot of subclause references in the form 6.1.2.3). For ., for example, restricting escaping to the case where . is preceded by digits that are preceded by either whitespace or start of string, and is followed by either whitespace or end of string, should suffice. (For the  case, add the case where . is at the start of the string and so we don't know if there might be digits preceding it.) For -, a sequence of one or more consecutive - should only need escaping if preceded by whitespace or start of string, and followed by whitespace or end of string (and this should work even for the  case).

This is a partial alternative to matthewwithanm#122 (open since April) for more selective escaping of some special characters. Here, we fix the test function naming (as noted in that PR) so the tests are actually run (and fix some incorrect test assertions so they pass). We also make escaping of `-#.)` (the most common cases of unnecessary escaping in my use case) more selective, while still being conservatively safe in escaping all cases of those characters that might have Markdown significance (including in the presence of wrapping, unlike in matthewwithanm#122). (Being conservatively safe doesn't include the cases where `.` or `)` start a fragment, where the existing code already was not conservatively safe.) There are certainly more cases where the code could also be made more selective while remaining conservatively safe (including in the presence of wrapping), so this is not a complete replacement for matthewwithanm#122, but by fixing some of the most common cases in a safe way, and getting the tests actually running, I hope this allows progress to be made where the previous attempt appears to have stalled, while still allowing further incremental progress with appropriately safe logic for other characters where useful.

AlexVonB · 2024-11-24T11:20:29Z

Hey! This was probably closed by #149 . Thanks for your time and patience!

chrispy-snps force-pushed the chrispy/update-escaping branch from 8b8f32c to fefc50d Compare April 14, 2024 15:22

be more selective about escaping special characters

0fb49c0

chrispy-snps force-pushed the chrispy/update-escaping branch from fefc50d to 0fb49c0 Compare April 14, 2024 15:30

i-ky mentioned this pull request Aug 19, 2024

Pin markdownify version OSMLatvija/zulip-rss#1

Closed

alfonsrv mentioned this pull request Sep 2, 2024

Escape all characters with Markdown significance #118

Merged

This was referenced Oct 2, 2024

More selective escaping of -#.) (alternative approach) #149

Merged

Escape right square bracket #148

Open

AlexVonB closed this Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

be more selective about escaping special characters #122

be more selective about escaping special characters #122

chrispy-snps commented Apr 14, 2024 •

edited

Loading

chrispy-snps commented Apr 14, 2024

jsm28 commented Apr 14, 2024

chrispy-snps commented Apr 14, 2024

jsm28 commented Apr 15, 2024

AlexVonB commented Nov 24, 2024

be more selective about escaping special characters #122

be more selective about escaping special characters #122

Conversation

chrispy-snps commented Apr 14, 2024 • edited Loading

chrispy-snps commented Apr 14, 2024

jsm28 commented Apr 14, 2024

chrispy-snps commented Apr 14, 2024

jsm28 commented Apr 15, 2024

AlexVonB commented Nov 24, 2024

chrispy-snps commented Apr 14, 2024 •

edited

Loading