Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Man writer fails to handle links with non-ASCII characters. #8508

Closed
van-de-bugger opened this issue Dec 28, 2022 · 3 comments
Closed

Man writer fails to handle links with non-ASCII characters. #8508

van-de-bugger opened this issue Dec 28, 2022 · 3 comments
Labels

Comments

@van-de-bugger
Copy link

Example:

$ cat test.md
SEE ALSO
========

* [Milk](https://en.wikipedia.org/wiki/Milk)
* [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form)

Note the second link, EBNF: it contains non-ASCII character, en-dash, between "Backus" and "Naur".

$ pandoc -t html5 < test.md > test.html

$ cat test.html 
<h1 id="see-also">SEE ALSO</h1>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Milk">Milk</a></li>
<li><a href="https://en.wikipedia.org/wiki/Extended_Backus–Naur_form">EBNF</a></li>
</ul>

HTML writer uses both link targets as-is. It is ok, since most browsers (at least, Firefox and Chromium) handle it correctly.

However, let's see how they are handled by man writer:

$ pandoc -s -t man < test.md > test.man

$ man -P cat ./test.man 
()                                                       ()

SEE ALSO
       • Milk (https://en.wikipedia.org/wiki/Milk)

       • EBNF

The first link is written correctly, but the second link just disappeared.

Pandoc version:

$ pandoc --version
pandoc 2.14.0.3
Compiled with pandoc-types 1.22.1, texmath 0.12.3.3, skylighting 0.10.5.2,
citeproc 0.4.0.1, ipynb 0.1.0.1
User data directory: /home/vdb/.local/share/pandoc
Copyright (C) 2006-2021 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
@jgm
Copy link
Owner

jgm commented Dec 29, 2022

Pandoc's man writer will simply skip relative links, because they are generally just clutter in a man page.
The way it determines whether something is a relative link is by using Text.Pandoc.URI.isURI, which uses an official URI parser. I believe your en-dash is not allowed in a URI; it should be URI-escaped, and that is why isURI fails on it.
So, what is happening here is that pandoc is thinking this is a relative link, because it isn't a valid absolute URI.

@van-de-bugger
Copy link
Author

I believe your en-dash is not allowed in a URI…

You are right, it is not. However, the philosophy of Markdown is

Markdown is intended to be as easy-to-read and easy-to-write as is feasible

https://daringfireball.net/projects/markdown/syntax#philosophy

Escaping special characters in URI makes it hardly-to-read and hardly-to-write. Just compare:

*   [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form)
*   [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)

or:

* [ЕБНФ](https://ru.wikipedia.org/wiki/Расширенная_форма_Бэкуса_—_Наура)
* [ЕБНФ](https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%88%D0%B8%D1%80%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D1%84%D0%BE%D1%80%D0%BC%D0%B0_%D0%91%D1%8D%D0%BA%D1%83%D1%81%D0%B0_%E2%80%94_%D0%9D%D0%B0%D1%83%D1%80%D0%B0)

I believe Markdown processor (pandoc) should allow most of characters to be unescaped, and take care about escaping and de-escaping them behind the scene, if required. For example, Mozilla Firefox browser allows entering invalid URI (https://en.wikipedia.org/wiki/Расширенная_форма_Бэкуса_—_Наура) in the address bar, and converts it to valid URI (https://ru.wikipedia.org/wiki/%D0%A0%D0%B0%D1%81%D1%88%D0%B8%D1%80%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D1%84%D0%BE%D1%80%D0%BC%D0%B0_%D0%91%D1%8D%D0%BA%D1%83%D1%81%D0%B0_%E2%80%94_%D0%9D%D0%B0%D1%83%D1%80%D0%B0) when it sends the request to the server. Chromium browser does the same. Epiphany (aka "Web", Gnome browser, based on WebKitGTK), does the same.

For sake of writability and readability, pandoc should allow all of the characters to be unescaped, with very few exceptions:

  • Closing bracket ), since it terminates an URI in [link](URI) construct,
  • Greater-than >, since it terminates an URI in <URI> construct,
  • Space, since it terminates an URI in `link construct.

Probably, I forgot something else, like new-line character.

Probably, pandoc should not worry about escaping at all, since Firefox, Chromium, and Epiphany browsers handle such links properly:

<a href="https://ru.wikipedia.org/wiki/Расширенная_форма_Бэкуса_—_Наура">РБНФ</a>

and I guess all other modern browsers do that, too.

@jgm
Copy link
Owner

jgm commented Jan 5, 2023

Yes, I agree that we should handle these unescaped characters in URIs -- I was just explaining why pandoc currently behaves the way it does. It shouldn't be too complicated to fix this -- maybe just URI escape non-ascii characters before calling isURI.

@jgm jgm closed this as completed in 0d891af Jan 6, 2023
liruqi pushed a commit to chinapedia/pandoc that referenced this issue Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants