Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFM to HTML creates incompatible anchors vs. Github/Gitlab anchor logic #5057

Closed
troyengel opened this issue Nov 11, 2018 · 10 comments
Closed

Comments

@troyengel
Copy link

Pandoc: 2.4-1 / Debian 9 (github DEB download)

Use: pandoc -s -f gfm+backtick_code_blocks -t html -o file.html file.md

The logic Pandoc is using to generate anchors doesn't match the same logic as used by Github/Gitlab rendering. After searching around, I think this is the routine they use, the list of STRIPPED chars seems to match what I am seeing:

In general, Pandoc's generated format allows more markup (slashes, parens, periods, etc.) in the generated anchor than Github/Gitlab does. Source example of various headings from my Markdown rendered in Pandoc using the markup not being generated correctly:

## Contents

  - [net.ifnames Naming](#netifnames-naming)
  - [/etc/hostname](#etchostname)
  - [Instanced Units (.service, .socket, etc...)](#instanced-units-service-socket-etc)
  - [Mount Units (.mount)](#mount-units-mount)
  - [Example Bind Mount - /var/tmp to /tmp](#example-bind-mount-vartmp-to-tmp)

## net.ifnames Naming

 - pandoc: `net.ifnames-naming`
 - gitlab / github: `netifnames-naming`

## /etc/hostname

 - pandoc: `/etc/hostname`
 - gitlab / github: `etchostname`

## Instanced Units (.service, .socket, etc...)

 - pandoc: `instanced-units-(.service,-.socket,-etc...)`
 - gitlab / github: `instanced-units-service-socket-etc`

## Mount Units (.mount)

 - pandoc: `mount-units-(.mount)`
 - gitlab / github: `mount-units-mount`

## Example Bind Mount - /var/tmp to /tmp

 - pandoc: `example-bind-mount---/var/tmp-to-/tmp`
 - gitlab / github: `example-bind-mount---vartmp-to-tmp`

Gists of each platform rendering showing their anchor generation, it matches what you see rendered when the MD file is saved into the repository view (same rendering engine):

The functional problem is that manually maintained TOC lists which work correctly when using the Markdown files linked directly from Github/Gitlab do not work when Pandoc processes them to create HTML Pages out of the content. With this kind of technical writing it's hard to not use these kinds of markup in Heading elements here and there, especially when referring to filenames or keywords which can't be reworded. Thanks!

Related issues I found: #2821 #3388

@jgm
Copy link
Owner

jgm commented Nov 11, 2018

GitHub doesn't use redcarpet any more for rendering. They use a variant of cmark.
It may be that they still use this list of characters, however.

@troyengel
Copy link
Author

Yeah I wasn't 100% sure, chasing this down ended in several dead ends, I'm not exactly sure what code is where to put my finger on the exact routine. The only reason it "felt right" was the list of stripped characters matched what I was seeing... (not having luck with Google finding the right source code)

@jgm
Copy link
Owner

jgm commented Nov 11, 2018

I notice that the gfm_auto_identifiersextentsion is not documented in MANUAL.txt. It should be (including the algorithm).

Here's the relevant function (toIdent from Text.Pandoc.Readers.CommonMark):

toIdent :: ReaderOptions -> [Inline] -> String
toIdent opts = map (\c -> if isSpace c then '-' else c)
               . filterer
               . map toLower . stringify
  where filterer = if isEnabled Ext_ascii_identifiers opts
                   then mapMaybe toAsciiChar
                   else filter (\c -> isLetter c || isAlphaNum c || isSpace c ||
                                      c == '_' || c == '-')

@jgm
Copy link
Owner

jgm commented Nov 11, 2018

@kivikakk might be able to help us locate the exact algorithm GitHub uses to create the automatic header identifiers, so we can match it better.

@jgm
Copy link
Owner

jgm commented Nov 11, 2018

Ah, it looks like the ascii_identifiers extension (which is enabled by default for gfm) is interfering:

% pandoc -f gfm+gfm_auto_identifiers
## Mount Units (.mount)
<h2 id="mount-units-(.mount)">Mount Units (.mount)</h2>

% pandoc -f gfm+gfm_auto_identifiers-ascii_identifiers
## Mount Units (.mount)
<h2 id="mount-units-mount">Mount Units (.mount)</h2>

This should be trivial to fix.

@troyengel
Copy link
Author

Roger that! I manage this via CI/CD, testing a pipeline build now to verify.... insert hold music

@troyengel
Copy link
Author

troyengel commented Nov 11, 2018

Every example above works great (the actual in-place content as well as a bunch more), 100% fixes everything right up when adding +gfm_auto_identifiers-ascii_identifiers to work around it. Thank you! :)

@jgm
Copy link
Owner

jgm commented Nov 11, 2018

Great, I'm going to fix the code so that ascii_identifiers works better with gfm_auto_identifiers. (While I'm at it, I'll make gfm_auto_identifiers work also with other formats, and I'll document it.)

jgm added a commit that referenced this issue Nov 11, 2018
This partially addresses #5057, fixing a bad interaction between
the `ascii_identifiers` extension and the `gfm_auto_identifiers`
extension, and creating identifiers that match the ones GitHub
produces.

This code still needs to be put somewhere common, so the
`gfm_auto_identifiers` extension will work with other formats.
@jgm jgm closed this as completed in a36d202 Nov 11, 2018
@kivikakk
Copy link

@kivikakk might be able to help us locate the exact algorithm GitHub uses to create the automatic header identifiers, so we can match it better.

Happy to help. We:

  • take the textual content of the heading (essentially the innerText of the DOM node)
  • convert it to lowercase in a Unicode-aware manner
  • remove all characters except for hyphen, space, and members of the Unicode general categories Letter, Mark, Number, and Connector_Punctuation
  • convert all spaces to hyphens

That leaves us with the ID. If we've already created a heading with an identical ID, we append -1 to it. If the ID suffixed with -1 has been taken, we try -2, and so on.

@jgm
Copy link
Owner

jgm commented Nov 12, 2018

@kivikakk thanks, we were close but not exactly there. This helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants