Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrote the escape function to escape all markdown characters #242

Merged
merged 4 commits into from
Jul 18, 2018
Merged

rewrote the escape function to escape all markdown characters #242

merged 4 commits into from
Jul 18, 2018

Conversation

yareeba
Copy link
Contributor

@yareeba yareeba commented Jun 28, 2018

in order to fix issue #233

Copy link
Collaborator

@domchristie domchristie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 💫 I love the use of reduce!

I have made quite a few comments, so I hope this doesn't come across as ungrateful…

Overall, I like this approach. As I mentioned before, I think the library should escape aggressively to ensure that valid markdown is outputted. However, I don't think it is necessary to escape all markdown characters in any context (sorry, I should've mention this before). Escaping *, _, and ` in all contexts makes sense, but for the other characters, I think we can be a little less aggressive so that the output isn't overloaded with \.

As I said, I hope the comments aren't too overwhelming. Let me know what you think, and thanks again!

src/turndown.js Outdated
@@ -6,6 +6,24 @@ import Node from './node'
var reduce = Array.prototype.reduce
var leadingNewLinesRegExp = /^\n*/
var trailingNewLinesRegExp = /\n*$/
const markdownReplacements = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps var escapes = [

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/turndown.js Outdated
[/-/g, '\\-'],
[/\+/g, '\\+'],
[/=/g, '\\='],
[/#/g, '\\#'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# is only used for atx-style headings, so we can be more specific and reduce the number of escapes. Something like:

[/^(\W* {0,3})(#{1,6} )/gm, '$1\\$2']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified this to

[/^(#{1,6})/g, '\\$1']

produces an escaped headline like this:

### headline = \### headline

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/turndown.js Outdated
[/=/g, '\\='],
[/#/g, '\\#'],
[/`/g, '\\`'],
[/~/g, '\\~'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~ can only be used for code blocks, so to reduce the number of escapes, something like:

[/^(\W* {0,3})~~~/gm, '$1\\~~~']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be used for a strikethrough effect:

~~strikethrough~~ = strikethrough

In this case shall we stick with the aggressive escaping for ~?

src/turndown.js Outdated
[/#/g, '\\#'],
[/`/g, '\\`'],
[/~/g, '\\~'],
[/\|/g, '\\|'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pipes (|) are used for tables, this should probably be part of turndown-plugin-gfm, and so can be removed

src/turndown.js Outdated
[/~/g, '\\~'],
[/\|/g, '\\|'],
[/\(/g, '\\('],
[/\)/g, '\\)'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ( or ) need to be escaped if [ and ] are escaped, so we can remove these two lines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test/index.html Outdated
<pre class="expected">You can use * for multiplication: 1.5 * 3 = 4.5</pre>
<div class="case" data-name="escaping =">
<div class="input">A sentence containing =</div>
<pre class="expected">A sentence containing \=</pre>
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case can probably be removed

test/index.html Outdated
<div class="input">42 > 1</div>
<pre class="expected">42 > 1</pre>
<pre class="expected">42 \> 1</pre>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case can be reverted

test/index.html Outdated
<div class="case" data-name="escaping parentheses">
<div class="input">(A sentence containing)</div>
<pre class="expected">\(A sentence containing\)</pre>
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case can be removed

test/index.html Outdated
<div class="case" data-name="escaping |">
<div class="input">A sentence containing |</div>
<pre class="expected">A sentence containing \|</pre>
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case can be removed

test/index.html Outdated
<div class="case" data-name="escaping ~">
<div class="input">A sentence containing ~</div>
<pre class="expected">A sentence containing \~</pre>
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should altered to only test a case where a ~ could be converted to markdown

@yareeba
Copy link
Contributor Author

yareeba commented Jul 4, 2018

Hi @domchristie, thanks for the comments. Hopefully I will get a chance to look at this PR at some point this week and make changes. 👍

@Paul-Ladyman
Copy link
Contributor

Hi @domchristie. I work with @ayusaf1992 and I've been looking through the PR. Your feedback seems to be mostly around not escaping certain characters so aggressively where they are only valid markdown in very specific circumstances and that makes sense to me. I've left a few replies here and there for further clarification but broadly speaking I think we'll try to find some time to make these changes this week :-).

Thanks for all your help!

src/turndown.js Outdated
[/\\/g, '\\\\'],
[/\*/g, '\\*'],
[/-/g, '\\-'],
[/\+/g, '\\+'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Paul-Ladyman
Copy link
Contributor

Hi @domchristie. In my recent commit I've made most of the changes discussed above. Outstanding issues are:

  • dealing with pipes in turndown-plugin-gfm
  • dealing with - given its used for bullets and hrs
  • how to escape ~ given they can be used for strikethrough.
  • dealing with _ (we've currently implemented it as &lowbar; for reasons described above)

We're keen to get this work merged as we have other priorities demanding our time. For that reason I'm thinking that the approach here to deal with |, - and ~ could stay as it is and be improved as part of a separate bit of work that we could perhaps leave to yourself. What do you think?

That just leaves _ which I'm a little less sure of how to handle. Any thoughts given the comment above?

Cheers.

@domchristie
Copy link
Collaborator

Hi @Paul-Ladyman, thanks so much for your work on this. Sorry it has taken a couple of days to get back to you (I have been busy working as well as enjoying the rather nice weather!) Please see my comments below:

dealing with pipes in turndown-plugin-gfm
how to escape ~ given they can be used for strikethrough.

By default, I don't think that Turndown should escape non-commonmark characters, so it shouldn't escape pipes. Tildes are an interesting case due to the way they are used in other markdown flavours. I think the ultimate solution to this would be to add an "escape" API (disableEscape('~'), addEscape('~' , …)), but this is definitely one for another time. In the meantime I think we should go with the less aggressive escape for tildes. escape will be documented as public API, and as a recommended place to customise escaping (#244 (comment)). Sorry this isn't ideal for your use-case.

dealing with - given its used for bullets and hrs
Do you think as a first pass it would be reasonable to escape - aggressively and work on more sophisticated escaping as part of a future task?

Yes, I think this is fine for now. I can work on implementing function replacements later.

dealing with _ (we've currently implemented it as &lowbar; for reasons described above)

I'm not keen on &lowbar; or HTML entities, because underscores are reasonably common in writing, and Turndown is often used to obtain a clean plain text version of an HTML document, so I'd likely get a lot of issues raised as a result :/ With that in mind, I think we should stick with plain underscore escaping (sorry!). I'd be interested to know why plain underscore escaping does not work for you, and if it's a common issue then we can reconsider implementing &lowbar;.

As I mentioned, I think we can improve the escaping feature, but in the meantime, you may want to override the escape method with something like:

var oldEscape = TurndownService.prototype.escape

TurndownService.prototype.escape = function (string) {
  var escapes = [
    [/[^\\]~/g, '\\~'], // only escape ~ that have not already been escaped
    [/\|/g, '\\|']
  ]
  string = string.replace(/_/g, '&lowbar;')
  string = oldEscape(string)
  
  return escapes.reduce(function (accumulator, escape) {
    return accumulator.replace(escape[0], escape[1])
  }, string)
}

Paul added 2 commits July 18, 2018 09:17
- Don't escape | here as its not commonmark
- Use regular escaping for _ rather than HTML entities
@Paul-Ladyman
Copy link
Contributor

Hi @domchristie,

Thanks again for your feedback!

Thinking more about pipes the gfm table syntax is so specific that I don't think we (my colleagues and I) should be immediately worried about our users recreating it. And I think its a good idea to keep the gfm stuff to the plugin so I've removed it from this PR. Updating the plugin may not be something we do ourselves in the short term however.

I didn't realise that the use of tilde for strikethrough was part of gfm. I've updated the regex to [/~~~/g, '\\~~~'] since otherwise it was having trouble picking up the opening and closing block tags. This should still result in less escaping.

I've also changed the _ escaping to normal backslashes. We weren't really expecting this to be merged but thought it was worth the conversation. I like your suggestion of how to add extra escaping on top of Turndown's so I think we'll do something like this whilst we try to get to the bottom of the issue.

Think that about covers it :-).

Cheers!

@domchristie
Copy link
Collaborator

Hi @Paul-Ladyman

Thanks again for your feedback!

Thank you! I really appreciate the time you and your colleagues (@ayusaf1992 & @olih) have put into this. Out of interest, are you using it for anything interesting? Let me know if you are able to do so :)

Thinking more about pipes the gfm table syntax is so specific that I don't think we (my colleagues and I) should be immediately worried about our users recreating it. And I think its a good idea to keep the gfm stuff to the plugin so I've removed it from this PR. Updating the plugin may not be something we do ourselves in the short term however.

👍

I didn't realise that the use of tilde for strikethrough was part of gfm. I've updated the regex to [/~~~/g, '\~~~'] since otherwise it was having trouble picking up the opening and closing block tags. This should still result in less escaping.

👍

I've also changed the _ escaping to normal backslashes. We weren't really expecting this to be merged but thought it was worth the conversation. I like your suggestion of how to add extra escaping on top of Turndown's so I think we'll do something like this whilst we try to get to the bottom of the issue.

👍

The changes will be in the next version which will be released soon. Thanks again!

@domchristie domchristie merged commit aad75e1 into mixmark-io:master Jul 18, 2018
@domchristie
Copy link
Collaborator

Released in v5.0.0

@Paul-Ladyman
Copy link
Contributor

Hi @domchristie,

Nice to see this got released.

Sorry I didn't answer your question before. @ayusaf1992, @olih and I work for the BBC. We're implementing a new internal article editor that journalists working in for example BBC Sport or BBC News will use to write and publish articles to their respective sites. The editor uses SlateJS and we use Markdown to represent Slate's rich text "marks" in a more common format.

We're actually thinking of moving away from Markdown as we have some requirements that don't seem to be a standard part of any spec. For example representing the difference between text written in a left-to-right language and text in a right-to-left language.

In any case we were happy to help out with this bug fix and pleased it ended up getting released :-).

Cheers,
Paul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants