Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

be more selective about escaping special characters #122

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

chrispy-snps
Copy link
Collaborator

@chrispy-snps chrispy-snps commented Apr 14, 2024

This is a refinement of #118 (thanks @jsm28!).

The current solution escapes every instance of every special character. Although conservative, this can lead to unnecessary escaping. For example,

>>> from markdownify import markdownify as md

>>> md('Pick a color (just 1) to use.')
'Pick a color (just 1\\) to use.'

>>> md('Pick a color, just 1. Write it down.')
'Pick a color, just 1\\. Write it down.'

>>> md('1 + 1 = 2')
'1 \\+ 1 \\= 2'

>>> md('start ----> end')
'start \\-\\-\\-\\-\\> end'

In our use case, our input content is technical documentation (many special characters) and the content is subsequently edited by humans, so it is desirable to minimize unnecessary escaping.

This pull request seeks to strike a balance between the following:

  • Maximizing required escaping
  • Minimizing unnecessary escaping
  • Minimizing code complexity
  • Minimizing runtime penalty (about 25% overall)

The tests cover a variety of required and unnecessary escaping cases, which can hopefully avoid any future regressions in escaping behavior.

This approach is not foolproof. Markdownify processes each text fragment in isolation, and thus the beginning of a particular string might not be the beginning of an output line. As a result, patterns are not applied across text fragment boundaries (such as adjacent <span> elements). Handling this probably requires a larger rework of the text processing code.

I also noticed that the original code had def text_misc() instead of def test_misc(), which caused the tests never to run.

@chrispy-snps
Copy link
Collaborator Author

@jsm28, @AlexVonB - what are your thoughts? I understand this is not as comprehensive as the "escape-everything" approach, but I'm hoping it strikes a balance between catching realistic scenarios while avoiding unnecessary scenarios. If issues are filed for false negatives or false positives, the approach can be refined (and the test set updated accordingly).

@jsm28
Copy link
Contributor

jsm28 commented Apr 14, 2024

Being at start of line isn't a very safe test at this stage either, because wrapping takes place later and can move text to start of line that only has significance there, so anything after a space should effectively be considered to be at start of line when wrapping, and this affects most of the examples given in this issue. I was considering making the escaping smarter anyway because there are lots of examples in my use case (converting 547 issues filed against past versions of the C standard and related documents for a new issue tracker) where it can safely be determined that the escaping is unnecessary in any position, such as 1.2.3.4 or X-Y. (My preprocessing removes all <span> - or converts it into other markup after processing a subset of the CSS found in some of the input - before the input gets passed to markdownify and I don't think fragment boundaries can be an issue in my case.)

(Yes, wrapping is significantly buggy at present, in particular losing <br> in the input; that's on my list of issues to fix.)

@chrispy-snps
Copy link
Collaborator Author

@jsm28 - aww dangit, I forgot about wrapping. I wonder if any of the work in this pull request is salvageable?

What are your thoughts on how to make escaping smarter? Perhaps we could escape everything, then apply a set of regular expressions after wrapping that removes unnecessary escapes (although this doesn't give me the warmest fuzzy feeling).

Possibly related, I've wanted to figure out a fix for adjacent <p> elements not getting a space between them in table cells:

>>> from markdownify import markdownify as md
>>> html_doc = """
... <table>
...   <tr>
...     <td>
...       <p>abc</p><p>def</p>
...     </td>
...   </tr>
... </table>
... """
>>> md(html_doc)
'\n\n\n| abcdef |\n| --- |\n\n\n'

@jsm28
Copy link
Contributor

jsm28 commented Apr 15, 2024

FIxing test function naming (plus any consequent fixes needed for tests to pass) is definitely salvageable!

I was thinking mainly about . and - as the main cases I saw where escaping obviously wasn't needed (my input data has a lot of subclause references in the form 6.1.2.3). For ., for example, restricting escaping to the case where . is preceded by digits that are preceded by either whitespace or start of string, and is followed by either whitespace or end of string, should suffice. (For the <span> case, add the case where . is at the start of the string and so we don't know if there might be digits preceding it.) For -, a sequence of one or more consecutive - should only need escaping if preceded by whitespace or start of string, and followed by whitespace or end of string (and this should work even for the <span> case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants