Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain characters in inline code incorrectly parsed (e.g., &) #372

Closed
nschloe opened this issue Nov 21, 2023 · 10 comments · Fixed by #393
Closed

Certain characters in inline code incorrectly parsed (e.g., &) #372

nschloe opened this issue Nov 21, 2023 · 10 comments · Fixed by #393

Comments

@nschloe
Copy link

nschloe commented Nov 21, 2023

MWE:

import mistune
from mistune.core import BlockState
markdown = mistune.create_markdown(renderer="ast")

md = r"`&<>`"
tokens = markdown(md)

print(tokens)

Output:

[{'type': 'paragraph', 'children': [{'type': 'codespan', 'raw': '&amp;&lt;&gt;'}]}]
@nschloe nschloe changed the title ` \& ` not preserved in parse-render loop \& not preserved in parse-render loop Nov 21, 2023
@nschloe nschloe changed the title \& not preserved in parse-render loop Certain characters in inline code incorrectly parsed (e.g., &) Nov 21, 2023
@torokati44
Copy link

torokati44 commented May 29, 2024

We also encountered this. The cause is 8452faf, more specifically this change, I think.
Putting HTML escaping into the parser stage, independently of the output format, is incorrect.

@veenstrajelmer
Copy link

veenstrajelmer commented Jul 5, 2024

I also encountered this, originally found in omnilib/sphinx-mdinclude#19. I did some digging and found that the issue lays in mistune.util.escape_url indeed as @torokati44 points towards, but only because its behavior is different depending on whether html is installed (this happens in the import, at least of mistune 2.0.5):

mistune/mistune/util.py

Lines 22 to 31 in cb580e8

def escape_url(link):
safe = (
':/?#@' # gen-delims - '[]' (rfc3986)
'!$&()*+,;=' # sub-delims - "'" (rfc3986)
'%' # leave already-encoded octets alone
)
if html is None:
return quote(link.encode('utf-8'), safe=safe)
return html.escape(quote(html.unescape(link), safe=safe))

Some code to reproduce:

from urllib.parse import quote
import html

link = "https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status"

# code from: def escape_url(link):
safe = (
    ':/?#@'           # gen-delims - '[]' (rfc3986)
    '!$&()*+,;='      # sub-delims - "'" (rfc3986)
    '%'               # leave already-encoded octets alone
)
out_nonhtml = quote(link.encode('utf-8'), safe=safe)
out_withhtml = html.escape(quote(html.unescape(link), safe=safe))
out_withhtml_noescape = quote(html.unescape(link), safe=safe)

print(out_nonhtml)
print(out_withhtml)
print(out_withhtml_noescape)

This gives different results. The first one is returned if html is not installed, the second one if html is installed. The third one is correct again:

https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&amp;metric=alert_status
https://sonarcloud.io/api/project_badges/measure?project=Deltares_ddlpy&metric=alert_status

Therefore, what I think should fix the issue is to remove the html.escape() from the escape_url function so the behaviour is always consistent and according to expectations.

Update for newer mistune versions
The escape_url function looks different in the master branch (>3.0.0):

def escape_url(link: str) -> str:
"""Escape URL for safety."""
safe = (
':/?#@' # gen-delims - '[]' (rfc3986)
'!$&()*+,;=' # sub-delims - "'" (rfc3986)
'%' # leave already-encoded octets alone
)
return escape(quote(unescape(link), safe=safe))

In this case omitting escape does the trick.

@veenstrajelmer
Copy link

Also today, in the master branch, & still gets converted to &amp;. I am not sure whether escape should just be removed from escape_url, since that does not seem to make sense given the function name. However, this was the behavior before if html was not installed with return quote(link.encode('utf-8'), safe=safe), so it might also just be the fix. Either way, it would be great if & markdown urls is not converted.

@itcarroll
Copy link

I'm going down the rabbit hole from jupyter-book > sphinx > myst > ??? here, maybe? I'm getting &amp; in markdown links that include query parameters. Any workaround?

@veenstrajelmer
Copy link

I think what you are encountering is because of the bug described in this issue indeed. No workaround as far as I know, @lepture could you ahare your thoughts on this discussion?

@mentalisttraceur
Copy link
Contributor

@itcarroll actually, Jupyter's MyST uses markdown-it-py, not Mistune.

@mentalisttraceur
Copy link
Contributor

mentalisttraceur commented Oct 31, 2024

@veenstrajelmer I think you've spotted a separate issue. Similar, but independent code paths. The premature HTML-escaping of inline code just uses escape, totally bypassing escape_url.

@veenstrajelmer
Copy link

@mentalisttraceur ok fair, but will this issue be resolved? I see limited response from maintainers of this repository, so I am not sure what to expect. If it is a different issue, does it help if I create a different issue, or will it not matter too much? I am a bit hesitant in putting even more investigation time in this package, because of te limited response. My issue is just a single readme badge that does not work in the docs of all my packages (e.g. https://deltares.github.io/dfm_tools). I can relatively easily convert all the readme's from markdown to rst, but only raised this issue since I prefer markdown.

@stdedos
Copy link

stdedos commented Nov 1, 2024

@veenstrajelmer "not having this discussion here", would matter at least for us that are "actively" participating on this issue 🤷
You can also clearly see that someone decided to step up and fix this issue (after this issue being stale for a while).

I would be optimistic - as long as issues don't get "more tangled".

@veenstrajelmer
Copy link

I was not intending to tangle issues, just noticed this issue that seems pretty much as what I am running into. I do not see how they are caused and therefore also not see how they are unrelated. To me they both come down to incorrect character escaping in mistune, but apparently they have different causes. Either way, I have created another issue here: #394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants