Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

character "<, >, &" will be translated into &lt; &gt; &amp; #779

Closed
poberwong opened this issue Aug 1, 2016 · 14 comments
Closed

character "<, >, &" will be translated into &lt; &gt; &amp; #779

poberwong opened this issue Aug 1, 2016 · 14 comments
Labels
category: code blocks L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue

Comments

@poberwong
Copy link

poberwong commented Aug 1, 2016

Anyone can tell me how to resolve it ? I just wanna show the characters correctly not its translation code in code-block

@Feder1co5oave
Copy link
Contributor

Are you using sanitize: true as an option?
Provide some sample input and expected output, please.

@adam-lynch
Copy link

Possible duplicate of #529

@hojas
Copy link

hojas commented Oct 24, 2016

sanitize can not control this.

@bholt
Copy link

bholt commented Nov 3, 2016

The problem I'm seeing is that the sanitization is happening incorrectly for code blocks specifically, so "&" works fine and shows up as "&", but "&" renders as "&amp;". Obviously I don't want to disable sanitize, I just want code blocks to de-sanitize.

This seems related to #287 which was never addressed.

@Feder1co5oave
Copy link
Contributor

Ampersand is a special character in html, so it must be escaped as an entity reference as &amp;

@joshbruce
Copy link
Member

& < >

@joshbruce
Copy link
Member

Given that small test - and it being scoped to code blocks, we might actually have something here. Believe HTML looks at the contents of code blocks differently.

@joshbruce joshbruce added L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue category: code blocks labels Jan 25, 2018
@barthel
Copy link

barthel commented Jan 26, 2018

The GFM definition of github about < in code-blocks including two examples: https://github.github.com/gfm/#fenced-code-blocks

Outside of code-blocks < should be escaped like \< and transformed into &lt;: https://github.github.com/gfm/#backslash-escapes

@joshbruce
Copy link
Member

@barthel: Thanks for the reference. That's interesting because the GFM examples have the lt and gt being converted to the unicode-like designation. But GitHub itself doesn't seem to escape them.

Weird. Am I missing something??

@wwkimball
Copy link

Just dropping a note of support for this issue. I was gearing one of my sites up to use marked.js (with highlight.js) until I ran into this. For education materials, I need to share blocks of Ruby and Puppet code where symbols like => are very often used. With this bug, my sample code blocks render very poorly, making them impossible to read.

The issue appears to be possible double-handling of &. For example, when => is encountered within a code-block, it is converted to =&gt;, which might be fine on its own. However, the resulting value appears to then be re-processed again into =&amp;gt;. That second handling is the problem.

I did find a viable workaround, however. In order to incorporate highlight.js, I evidently had to write some custom renderer code for marked.js, so I took guidance from Shuhei Kagawa. Because I was already customizing the HTML output to add CSS style hints for highlight.js, I took the same opportunity to fix the broken &amp;s. Note that the following code is significantly different from Shuhei's because I'm still required to support Internet Explorer users (yes, it's painful and yes, it's reality) along with all the latest browsers.

My workaround to this problem (and the problem of making marked.js and highlight.js play nicely together with CSS):

// With guidance from https://shuheikagawa.com/blog/2015/09/21/using-highlight-js-with-marked/
const marked       = window.marked;
const highlightjs  = window.hljs;
const hljsRenderer = new marked.Renderer();

hljsRenderer.code = function(block, lang) {
    // Colorize the block only if the language is known to highlight.js
    var realLang = ((null == lang) ? 'plaintext' : lang);
    var colorized = !!(realLang && highlightjs.getLanguage(realLang))
        ? highlightjs.highlight(realLang, block).value
        : block
    ;
    return '<pre rel="' + realLang + '">' + "\n"
        + '<code class="hljs ' + realLang + '">'
        + colorized.replace(/&amp;/g, '&')
        + '</code>' + "\n"
        + '</pre>'
    ;
};

// Set the renderer to marked
marked.setOptions({
    renderer: hljsRenderer
});

// Monkey in String.trimStart() support for browsers that don't support it
String.prototype.trimStart = String.prototype.trimStart || function() {
    return this.replace(/^\s+/, '');
}

// Render Markdown-formatted publications as HTML
document.getElementById('publication_body').innerHTML =
    marked(
        document.getElementById('publication_body').innerHTML.trimStart()
    );

Bonus: The code above enabled me to display the code language above every code-block via CSS, like:

pre[rel]::before {
    text-transform: capitalize;
    font-size: 0.75em;
    content: attr(rel);
    color: white;
}

@styfle
Copy link
Member

styfle commented Oct 15, 2018

The issue appears to be possible double-handling of &. For example, when => is encountered within a code-block, it is converted to =&gt;, which might be fine on its own. However, the resulting value appears to then be re-processed again into =&amp;gt;. That second handling is the problem.

I'm glad you found a workaround!

However, I am not seeing this issue with the default settings in marked.

Example markdown

Perhaps you are using the sanitize: true option as others have mentioned.

I would suggest using a different sanitizer than the built-in one as discussed in #1232
...in particular, this comment

@wwkimball
Copy link

We should discard my reply to this thread as a false alarm. PHP is unexpectedly sanitizing my output before any JavaScript ever gets to see it. With my sole focus on converting Markdown documents into HTML, my eyes were seeing formatting instead of content this whole time. Satisfied with the change in appearance of the Markdown content (to HTML) by incorporating marked.js with highlight.js, I sat back and read a test document. At that point, I saw the undesired, overly-sanitized output. By then, only JavaScript was salient in my thoughts and my thinking was mistakenly boxed into that frame. I blamed marked.js for a PHP issue and for that, I am sorry.

For Posterity

In the MySQL database, the Markdown content is correct; loads of => assignment operators within myriad code-blocks. When I dump that content to log from within PHP, it is still unchanged; loads of => assignment operators within myriad code-blocks. However, when the content gets written to the HTML output stream (via DOMDocument) within the PHP code, all those => are being mysteriously changed into =&gt;. Something within PHP is sanitizing the stream before any JavaScript ever receives it. I had no idea this was happening.

Separately (and this is entirely moot), I was not setting any options for marked.js other than the renderer. The code you see in my earlier reply is 100% of the JavaScript on that page other than the imports in the head, which are simply:

<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<script src="/src/js/highlight/highlight.pack.js"></script>
<script src="/src/js/blog-render-md.js" defer></script>

The entire content of blog-render-md.js is visible in the cited reply, constituting all JavaScript on the page.

@styfle
Copy link
Member

styfle commented Oct 15, 2018

@wwkimball Thanks for the details, that makes sense 👍

Since this issue doesn't have any steps to reproduce, I'm going to close it.

I'm going to reiterate for future readers, see #1232 for better sanitize options.

@mominrazashahid
Copy link

I am also facing this issue using express sanitize any solution ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: code blocks L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue
Projects
None yet
Development

No branches or pull requests

10 participants