Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

be more selective about escaping special characters #122

Conversation

chrispy-snps
Copy link
Collaborator

@chrispy-snps chrispy-snps commented Apr 14, 2024

This is a refinement of #118 (thanks @jsm28!).

The current solution escapes every instance of every special character. Although conservative, this can lead to unnecessary escaping. For example,

>>> from markdownify import markdownify as md

>>> md('Pick a color (just 1) to use.')
'Pick a color (just 1\\) to use.'

>>> md('Pick a color, just 1. Write it down.')
'Pick a color, just 1\\. Write it down.'

>>> md('1 + 1 = 2')
'1 \\+ 1 \\= 2'

>>> md('start ----> end')
'start \\-\\-\\-\\-\\> end'

In our use case, our input content is technical documentation (many special characters) and the content is subsequently edited by humans, so it is desirable to minimize unnecessary escaping.

This pull request seeks to strike a balance between the following:

  • Maximizing required escaping
  • Minimizing unnecessary escaping
  • Minimizing code complexity
  • Minimizing runtime penalty (about 25% overall)

The tests cover a variety of required and unnecessary escaping cases, which can hopefully avoid any future regressions in escaping behavior.

This approach is not foolproof. Markdownify processes each text fragment in isolation, and thus the beginning of a particular string might not be the beginning of an output line. As a result, patterns are not applied across text fragment boundaries (such as adjacent <span> elements). Handling this probably requires a larger rework of the text processing code.

I also noticed that the original code had def text_misc() instead of def test_misc(), which caused the tests never to run.

@chrispy-snps chrispy-snps force-pushed the chrispy/update-escaping branch from 8b8f32c to fefc50d Compare April 14, 2024 15:22
@chrispy-snps
Copy link
Collaborator Author

@jsm28, @AlexVonB - what are your thoughts? I understand this is not as comprehensive as the "escape-everything" approach, but I'm hoping it strikes a balance between catching realistic scenarios while avoiding unnecessary scenarios. If issues are filed for false negatives or false positives, the approach can be refined (and the test set updated accordingly).

@chrispy-snps chrispy-snps force-pushed the chrispy/update-escaping branch from fefc50d to 0fb49c0 Compare April 14, 2024 15:30
@jsm28
Copy link
Contributor

jsm28 commented Apr 14, 2024

Being at start of line isn't a very safe test at this stage either, because wrapping takes place later and can move text to start of line that only has significance there, so anything after a space should effectively be considered to be at start of line when wrapping, and this affects most of the examples given in this issue. I was considering making the escaping smarter anyway because there are lots of examples in my use case (converting 547 issues filed against past versions of the C standard and related documents for a new issue tracker) where it can safely be determined that the escaping is unnecessary in any position, such as 1.2.3.4 or X-Y. (My preprocessing removes all <span> - or converts it into other markup after processing a subset of the CSS found in some of the input - before the input gets passed to markdownify and I don't think fragment boundaries can be an issue in my case.)

(Yes, wrapping is significantly buggy at present, in particular losing <br> in the input; that's on my list of issues to fix.)

@chrispy-snps
Copy link
Collaborator Author

@jsm28 - aww dangit, I forgot about wrapping. I wonder if any of the work in this pull request is salvageable?

What are your thoughts on how to make escaping smarter? Perhaps we could escape everything, then apply a set of regular expressions after wrapping that removes unnecessary escapes (although this doesn't give me the warmest fuzzy feeling).

Possibly related, I've wanted to figure out a fix for adjacent <p> elements not getting a space between them in table cells:

>>> from markdownify import markdownify as md
>>> html_doc = """
... <table>
...   <tr>
...     <td>
...       <p>abc</p><p>def</p>
...     </td>
...   </tr>
... </table>
... """
>>> md(html_doc)
'\n\n\n| abcdef |\n| --- |\n\n\n'

@jsm28
Copy link
Contributor

jsm28 commented Apr 15, 2024

FIxing test function naming (plus any consequent fixes needed for tests to pass) is definitely salvageable!

I was thinking mainly about . and - as the main cases I saw where escaping obviously wasn't needed (my input data has a lot of subclause references in the form 6.1.2.3). For ., for example, restricting escaping to the case where . is preceded by digits that are preceded by either whitespace or start of string, and is followed by either whitespace or end of string, should suffice. (For the <span> case, add the case where . is at the start of the string and so we don't know if there might be digits preceding it.) For -, a sequence of one or more consecutive - should only need escaping if preceded by whitespace or start of string, and followed by whitespace or end of string (and this should work even for the <span> case).

jsm28 added a commit to jsm28/python-markdownify that referenced this pull request Oct 2, 2024
This is a partial alternative to matthewwithanm#122 (open since April) for more
selective escaping of some special characters.

Here, we fix the test function naming (as noted in that PR) so the
tests are actually run (and fix some incorrect test assertions so they
pass).  We also make escaping of `-#.)` (the most common cases of
unnecessary escaping in my use case) more selective, while still being
conservatively safe in escaping all cases of those characters that
might have Markdown significance (including in the presence of
wrapping, unlike in matthewwithanm#122).  (Being conservatively safe doesn't include
the cases where `.` or `)` start a fragment, where the existing code
already was not conservatively safe.)

There are certainly more cases where the code could also be made more
selective while remaining conservatively safe (including in the
presence of wrapping), so this is not a complete replacement for matthewwithanm#122,
but by fixing some of the most common cases in a safe way, and getting
the tests actually running, I hope this allows progress to be made
where the previous attempt appears to have stalled, while still
allowing further incremental progress with appropriately safe logic
for other characters where useful.
@AlexVonB
Copy link
Collaborator

Hey! This was probably closed by #149 . Thanks for your time and patience!

@AlexVonB AlexVonB closed this Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants