Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown reader: class="uri" added for non-autolink hyperlinks #4913

Closed
cjerdonek opened this issue Sep 17, 2018 · 13 comments
Closed

Markdown reader: class="uri" added for non-autolink hyperlinks #4913

cjerdonek opened this issue Sep 17, 2018 · 13 comments

Comments

@cjerdonek
Copy link

cjerdonek commented Sep 17, 2018

With the most recent Pandoc (version 2.3), when converting a hyperlink from Pandoc Markdown to HTML, Pandoc can add class="uri" even when the Markdown link isn't an autolink (or isn't even allowed to be). For example ("try it" link):

[https://pandoc.org](https://pandoc.org)

[pandoc](pandoc)

becomes

<p><a href="https://pandoc.org" class="uri">https://pandoc.org</a></p>
<p><a href="pandoc" class="uri">pandoc</a></p>

You can see the effect this has on round-tripping below. With the above example--

<p><a href="https://pandoc.org" class="uri">https://pandoc.org</a></p>
<p><a href="pandoc" class="uri">pandoc</a></p>

you get--

<https://pandoc.org>

[pandoc]

  [pandoc]: pandoc {.uri}

See the related issue #1501 (where the feature was first added) and #4615 (re: email addresses).

@jgm
Copy link
Owner

jgm commented Sep 17, 2018 via email

@cjerdonek
Copy link
Author

In the Markdown writer case, one thing you'd need to decide is whether autolinking should be done (1) only if class="uri" is present, or (2) opportunistically (i.e. whenever possible). Currently, it seems to be doing it opportunistically. Another possibility would be to make this behavior configurable.

@jgm
Copy link
Owner

jgm commented Sep 18, 2018 via email

@cjerdonek
Copy link
Author

It seems to me like the HTML reader should preserve whether class="uri" was present (and similarly for other readers). Checking whether the link text = url doesn't add any information, so it wouldn't hurt to postpone it until writing. The case of writing markdown idiomatically (and perhaps other formats?) could be enabled by an extension (e.g. autolink_links) that checks whether the link text = url. That would let the user choose between round-tripping and always autolinking when possible.

This proposal would also handle cases like [https://pandoc.org](https://pandoc.org) in my example above for things like markdown -> AST -> markdown. It would enable both round-tripping and preferring idiomatic markdown.

@jgm
Copy link
Owner

jgm commented Sep 18, 2018 via email

@cjerdonek
Copy link
Author

cjerdonek commented Sep 18, 2018

So if we required it, we'd rarely get markdown autolinks in html -> md conversion.
...
I think it's probably acceptable if [https://pandoc.org](https://pandoc.org) turns into an autolink.

This can all be addressed by doing the check / auto-conversion when writing the Markdown. I was just advocating that the HTML (and Markdown) readers preserve things faithfully (and providing an option via extension to disable the auto-conversion when writing the Markdown). Doing the auto-conversion when writing the Markdown would also result in idiomatic Markdown in a wider range of scenarios. It would also confine the code to one place (the Markdown writer), instead of having to do the check in every reader.

@jgm
Copy link
Owner

jgm commented Sep 18, 2018

I'm lost. I think I misinterpreted your original issue, jumping to conclusions, so let's back up.

Your original issue is this:

[pandoc](pandoc)

gets rendered by the HTML writer with a class="uri". That's clearly undesirable. We should fix that by having the HTML writer check not just for identity between the link text and the URL, but that the URL is an absolute URL.

Is there anything else you think should be changed, or will that do it?

@cjerdonek
Copy link
Author

There are two issues I mentioned in my first post. The first (for the markdown -> html case) is that for examples like this--

[https://pandoc.org](https://pandoc.org)

[pandoc](pandoc)

I don't think class="uri" should be present for either (because neither was originally written in the <https://pandoc.org> form).

The second (for the html -> markdown case) is that if class="uri" isn't present in the HTML, then I don't think it should be auto-linked in the Markdown. But in my previous comments I was saying I'd be okay if it auto-linked opportunistically by default but later on there could be an option to disable that behavior when writing Markdown.

Both of the above issues could be addressed by reading faithfully so the AST is always an accurate representation, and confining the special logic of checking the link text and url to only the Markdown writer.

@jgm
Copy link
Owner

jgm commented Sep 18, 2018

OK. Well, if we keep the #1501 idea of having class="uri" so that autolinks can be specially styled, then we face a decision about how the HTML writer is supposed to tell whether a Link element in the AST is to be rendered as an autolink. (Note that there is no separate Autolink constructor in the AST.) There are two possible approaches:

  1. It's an autolink if the link text matches the URL and it's an absolute URL. (This is our current approach, and it's done in the writers.) With this approach, [https://pandoc.org](https://pandoc.org) will be an autolink. And you are suggesting that autolinks should have a class when converted to html #1501 be reversed, it will therefore have the uri class. [pandoc](pandoc) should not be an autolink; as mentioned, that's just a bug; the writer should test for an absolute url.

  2. It's an autolink if it the link text matches the URL, it's an absolute URL, and the link has the "uri" class. With this approach, we'd have to modify the Markdown reader to add the uri class to a Link when it parses an autolink. On this approach, [https://pandoc.org](https://pandoc.org) would not be an autolink. But we'd then have to decide how the HTML (and other) readers should behave. Presumably we should render HTML links as autolinks whenever possible, and this means adding the uri class to Links when their link text matches the URL and it's an absolute URL. Because of this, [https://pandoc.org](https://pandoc.org) would turn into an autolink if you converted markdown -> HTML, then HTML -> markdown.

I hope that makes it clear what the options are.

Of course, you might have in mind modifying the AST types so that Autolink is a first-class constructor (separate from Link). That would be a much bigger change, for dubious benefits.

@cjerdonek
Copy link
Author

cjerdonek commented Sep 19, 2018

There is a third and I think simpler option, which is for the HTML writer to use only the presence of the "uri" class to identify autolinks. With this approach, the Markdown reader would add the uri class when it parses a link formatted as an autolink, and the HTML writer wouldn't need to do anything -- just pass the class attribute through as is. It wouldn't be necessary for the writer to check the link text or URL since the Markdown reader would only be adding the "uri" class for actual (valid) autolinks.

I'm not suggesting that the suggestion behind #1501 be reversed -- just that it be implemented differently. I take the #1501 suggestion ("autolinks should have a class when converted to html") to mean that Markdown of the form <https://pandoc.org> get the class added in HTML (because it is written as an autolink), and not Markdown like [https://pandoc.org](https://pandoc.org) (because it is not written as an autolink). In other words, the information of the styling should be preserved as opposed to being added whenever possible.

I think the idea of rendering HTML links as autolinks whenever possible shouldn't be the default behavior because doing so loses whether the link was originally styled as an autolink.

@jgm
Copy link
Owner

jgm commented Sep 19, 2018

There is a third and I think simpler option, which is for the HTML writer to use only the presence of the "uri" class to identify autolinks.

I don't understand. The HTML writer doesn't need to identify autolinks, since there's no distinction in HTML syntax between autolinks and other links. The issue I raised under (2) above is not with the HTML writer, it's with the HTML reader. Maybe you mean the reader?

I think the idea of rendering HTML links as autolinks whenever possible shouldn't be the default behavior because doing so loses whether the link was originally styled as an autolink.

This concerns the HTML reader, presumably? If all HTML were generated by pandoc, it would make sense for the HTML reader never to add the "uri" class -- just to pass it through if it's present. But most HTML has not been generated by pandoc. For example, an autolink on GitHub will be rendered in HTML as

<a href=http://example.com rel="nofollow">http://example.com</a>

So on your proposal, converting this HTML to markdown would give you the unidiomatic

[http://example.com](http://example.com)

@cjerdonek
Copy link
Author

Sorry this has been hard to communicate about. Perhaps it's because we're coming at this from different preconceptions / ways of thinking about it.

I don't understand. The HTML writer doesn't need to identify autolinks, since there's no distinction in HTML syntax between autolinks and other links.

My last comment was focused primarily on the HTML writer (AST -> HTML) step (because that's what I thought your comment was about), and in particular whether the HTML writer should ever add the "uri" class (Pandoc's way of indicating an autolink in HTML). I was suggesting that the HTML writer never add the uri class if not already present in the AST, but simply pass it through as is. That way an HTML link would only have the uri class going from Markdown to HTML if the Markdown reader added it (i.e. if it had the form of an autolink <https://pandoc.org> in the Markdown).

For the other direction (HTML -> AST -> Markdown), my proposal was that the HTML reader (HTML -> AST) pass the uri class as is. And yes, this means that HTML "in the wild" would normally not have the uri class after reading to the AST. For the Markdown writer (AST -> Markdown), I was suggesting that it could render links as autolinks opportunistically by default (i.e. whenever possible), which would give you the idiomatic Markdown that you prefer. Perhaps the difference in what I'm suggesting is that the special logic checking that the link text matches the URL would be confined to the Markdown writer, and not in any of the readers.

One advantage of my proposal is that all of the autolink logic would be restricted to the Markdown reader (adding the uri class depending on how it's formatted in the Markdown) and writer (applying the link-text heuristic). All of the other readers and writers would simply pass things through as is.

Another aspect of my proposal (of lesser importance, perhaps down the road) was to allow disabling the Markdown writer's opportunistic autolinking by disabling an autolink_links extension (enabled by default). In the case of disabling, the Markdown writer would only format a link as an autolink if the uri class was already present in the AST. This would give users finer-grained control over what links should be formatted as autolinks when writing Markdown, as well as support more faithful round-tripping when going from Markdown to HTML and back.

@jgm
Copy link
Owner

jgm commented Sep 19, 2018

I was suggesting that the HTML writer never add the uri class if not already present in the AST, but simply pass it through as is. That way an HTML link would only have the uri class going from Markdown to HTML if the Markdown reader added it (i.e. if it had the form of an autolink https://pandoc.org in the Markdown).

Okay. That means the Markdown reader needs to add this class for autolinks (currently it is added by the HTML writer).

And yes, this means that HTML "in the wild" would normally not have the uri class after reading to the AST. For the Markdown writer (AST -> Markdown), I was suggesting that it could render links as autolinks opportunistically by default (i.e. whenever possible), which would give you the idiomatic Markdown that you prefer.

So on your proposal, for HTML -> Markdown, these would both be rendered as autolinks:

<a href="http://example.com">http://example.com</a>
<a href="http://example.com" class="uri">http://example.com</a>

What about something like

<a href="http://example.com" class="otherclass" id="foo">http://example.com</a>

Presumably that shouldn't be rendered as an autolink, as this would lose information. So the Markdown writer's test could be: (a) link text matches URL, (b) URL is absolute, (c) no attribute besides possibly a "uri" class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants