-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
be more selective about escaping special characters #122
be more selective about escaping special characters #122
Conversation
8b8f32c
to
fefc50d
Compare
@jsm28, @AlexVonB - what are your thoughts? I understand this is not as comprehensive as the "escape-everything" approach, but I'm hoping it strikes a balance between catching realistic scenarios while avoiding unnecessary scenarios. If issues are filed for false negatives or false positives, the approach can be refined (and the test set updated accordingly). |
fefc50d
to
0fb49c0
Compare
Being at start of line isn't a very safe test at this stage either, because wrapping takes place later and can move text to start of line that only has significance there, so anything after a space should effectively be considered to be at start of line when wrapping, and this affects most of the examples given in this issue. I was considering making the escaping smarter anyway because there are lots of examples in my use case (converting 547 issues filed against past versions of the C standard and related documents for a new issue tracker) where it can safely be determined that the escaping is unnecessary in any position, such as (Yes, wrapping is significantly buggy at present, in particular losing |
@jsm28 - aww dangit, I forgot about wrapping. I wonder if any of the work in this pull request is salvageable? What are your thoughts on how to make escaping smarter? Perhaps we could escape everything, then apply a set of regular expressions after wrapping that removes unnecessary escapes (although this doesn't give me the warmest fuzzy feeling). Possibly related, I've wanted to figure out a fix for adjacent
|
FIxing test function naming (plus any consequent fixes needed for tests to pass) is definitely salvageable! I was thinking mainly about |
This is a partial alternative to matthewwithanm#122 (open since April) for more selective escaping of some special characters. Here, we fix the test function naming (as noted in that PR) so the tests are actually run (and fix some incorrect test assertions so they pass). We also make escaping of `-#.)` (the most common cases of unnecessary escaping in my use case) more selective, while still being conservatively safe in escaping all cases of those characters that might have Markdown significance (including in the presence of wrapping, unlike in matthewwithanm#122). (Being conservatively safe doesn't include the cases where `.` or `)` start a fragment, where the existing code already was not conservatively safe.) There are certainly more cases where the code could also be made more selective while remaining conservatively safe (including in the presence of wrapping), so this is not a complete replacement for matthewwithanm#122, but by fixing some of the most common cases in a safe way, and getting the tests actually running, I hope this allows progress to be made where the previous attempt appears to have stalled, while still allowing further incremental progress with appropriately safe logic for other characters where useful.
Hey! This was probably closed by #149 . Thanks for your time and patience! |
This is a refinement of #118 (thanks @jsm28!).
The current solution escapes every instance of every special character. Although conservative, this can lead to unnecessary escaping. For example,
In our use case, our input content is technical documentation (many special characters) and the content is subsequently edited by humans, so it is desirable to minimize unnecessary escaping.
This pull request seeks to strike a balance between the following:
The tests cover a variety of required and unnecessary escaping cases, which can hopefully avoid any future regressions in escaping behavior.
This approach is not foolproof. Markdownify processes each text fragment in isolation, and thus the beginning of a particular string might not be the beginning of an output line. As a result, patterns are not applied across text fragment boundaries (such as adjacent
<span>
elements). Handling this probably requires a larger rework of the text processing code.I also noticed that the original code had
def text_misc()
instead ofdef test_misc()
, which caused the tests never to run.