Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring unsupported HTML tags in Markdown input inserts a space in LaTeX and RST output #5050

Closed
mgeier opened this issue Nov 6, 2018 · 3 comments

Comments

@mgeier
Copy link

mgeier commented Nov 6, 2018

It is expected that unsupported HTML tags get stripped out, but this should not insert a space in the beginning of the next line.

$ pandoc --from markdown --to latex
<abc>
x
^D
 x
mg@MG:~$ pandoc --from markdown --to rst
<abc>
x
^D
 x

Note that in LaTeX that may not be a problem, but in reST, the indentation creates an unwanted "blockquote".

Interestingly, this does not happen with HTML input:

$ pandoc --from html --to rst
<abc>
x
^D
x
$ pandoc --version
pandoc 2.2.1
Compiled with pandoc-types 1.17.5.1, texmath 0.11.1, skylighting 0.7.3
...
@mb21
Copy link
Collaborator

mb21 commented Nov 6, 2018

I guess the solution is to bring the markdown reader in line with the commonmark spec?

~ pandoc -f commonmark -t native
<abc>
x
^D
[RawBlock (Format "html") "<abc>\nx\n"]

~ pandoc -f markdown -t native
<abc>
x
^D
[Para [RawInline (Format "html") "<abc>",SoftBreak,Str "x"]]

Not sure how the SoftBreak effect the output again...

@jgm
Copy link
Owner

jgm commented Nov 6, 2018

Correct, the difference between markdown and commonmark here is all about whether <abc> is parsed as block or inline HTML. Pandoc's markdown reader parses it as inline; you get the space because there's a space (to be exact, a SoftBreak) between the two inline elements: the (non-rendered) raw HTML and the string 'x'.

HTML is different because raw_html extension is disabled by default for HTML. If you turn it on, you'll have exactly the same results: pandoc --from html+raw_html --to rst.

It should still be possible to change pandoc so that, even with <abc> parsed as an inline tag, you don't get the extra space. (This is especially important in RST, since the space will change the semantics.) Perhaps Text.Pandoc.Pretty should be modified so that a BreakingSpace at the beginning or end of a Doc is not rendered. This would make sense to me, and it would fix the issue. You could still use text " " <> foo to get a space at the beginning if you wanted one, but space <> foo would not produce that.

@jgm jgm closed this as completed in 6619b96 Nov 7, 2018
jgm added a commit that referenced this issue Nov 7, 2018
@mgeier
Copy link
Author

mgeier commented Nov 7, 2018

@jgm Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants