Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFM to HTML conversion adding extra paragraph markup around sub-list elements #6589

Closed
troyengel opened this issue Aug 4, 2020 · 11 comments
Closed

Comments

@troyengel
Copy link

I leverage the latest pandoc (grabbed with curl) in a CI/CD pipeline to process Markdown (GFM) into HTML; as I only edit these files once every few months I don't know exactly when this started happening but my thought is between version 2.9.2.1 and 2.10 based on the last time I edited a MD file and it (re)generated the HTML which was fine. This report is having just edited and run the pipeline today which used 2.10.1 version compiled for Debian downloaded via github Releases.

The Markdown looks like this:

## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)

The processing used in a loop of all files is creating the HTML like so:

  pandoc -s \
    -f gfm+gfm_auto_identifiers-ascii_identifiers \
    -t html \
    --template="./style/pandoc_html.tpl" \
    --include-in-header="./style/header_include.html" \
    --include-before-body="./style/body_before.html" \
    --include-after-body="./style/body_after.html" \
    --metadata pagetitle="$_TITLE" \
    --css="${_CSS}" \
    -o "${_HTML}" "$file"

The resulting HTML has extra embedded <p> elements wrapping the sub-list items, but this is inconsistent; on some markdown pages where only a top list exists (TOC with no leafs) it injects <p> inside the list elements, but in this case it's "mix and match" within the list and sub-list like so:

<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><p><a href="#apache-iptables-ports">Apache iptables Ports</a></p></li>
<li><p><a href="#apache-default-template">Apache Default Template</a></p></li>
<li><p><a href="#apache-80-template">Apache 80 Template</a></p></li>
<li><p><a href="#apache-443-template">Apache 443 Template</a></p></li>
</ul></li>
</ul>

As far as I can recall, the <p> elements never injected in the older version (I would have noticed as now I have huge extra line spacing between elements), now all the TOC generated output is a mashup of "sometimes" causing odd visual formatting. Once CSS is applied, the above ends up looking like this:

Screenshot at 2020-08-04 12-17-54

The result is semi-random (I'm sure there's a pattern hiding in there), as the placement of <p> elements seems to be random depending on the TOC construction (how many elements and sub-list elements). Pandoc definitely did not do this before, it's something new -- the last time I ran my CI/CD it used a pandoc feature gfm+backtick_code_blocks+... which was deprecated in the latest code (my CI/CD failed and I had to go fix the script to remove that), if that helps tell when it was last working correctly - that feature was still possible/accepted.

Thanks!

@jgm
Copy link
Owner

jgm commented Aug 4, 2020

Pandoc switched to a new library for generatingparsing gfm in 2.10.1.
We should be able to fix this, but for now you could try reverting to an earlier version.

@mb21
Copy link
Collaborator

mb21 commented Aug 4, 2020

Might be a bug in commonmark-hs... should we migrate it there?

@jgm
Copy link
Owner

jgm commented Aug 4, 2020

Well this is puzzling, because I can't reproduce it.

 % pandoc -f gfm+gfm_auto_identifiers-ascii_identifiers -t html 
## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)
^D
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><a href="#apache-iptables-ports">Apache iptables Ports</a></li>
<li><a href="#apache-default-template">Apache Default Template</a></li>
<li><a href="#apache-80-template">Apache 80 Template</a></li>
<li><a href="#apache-443-template">Apache 443 Template</a></li>
</ul></li>
</ul>

@troyengel
Copy link
Author

While I can't recreate it using the tool above, a little testing tells me that it broke between 2.10 and 2.10.1; I was able to test 2.9.2.1 and 2.10 and receive the expected HTML sans <p> elements:

<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><a href="#apache-iptables-ports">Apache iptables Ports</a></li>
<li><a href="#apache-default-template">Apache Default Template</a></li>
<li><a href="#apache-80-template">Apache 80 Template</a></li>
<li><a href="#apache-443-template">Apache 443 Template</a></li>
</ul></li>
</ul>

pandoc_2 9 2 1

When I bounce the script back up to the latest 2.10.1 we get the injection of the extra paragraphs inside lists. Verbose mode didn't reveal anything interesting, no changes other than swapping out the release version.

CI/CD pipeline (.gitlab-ci.yml)

image: debian:latest

before_script:
  - bash ./bin/debian_pandoc.sh >/dev/null

test:
  stage: test
  script:
  - bash ./bin/generate_html.sh
  only:
  - branches
  - tags

pages:
  stage: deploy
  script:
  - bash ./bin/generate_html.sh
  artifacts:
    paths:
    - public
  only:
  - master

debian_pandoc.sh

export DEBCONF_NOWARNINGS="yes"
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get -qq update
apt-get -qq -y install curl pandoc

_VER=$(curl -s "https://api.github.com/repos/jgm/pandoc/releases/latest" | grep -Po '"tag_name": "\K.*?(?=")')

curl -sLo "pandoc-${_VER}-1-amd64.deb" "https://github.com/jgm/pandoc/releases/download/${_VER}/pandoc-${_VER}-1-amd64.deb"

apt-get -qq -y install "./pandoc-${_VER}-1-amd64.deb"
apt-get -qq -y autoremove

generate_html.sh

# gitlab-ci looks here
[[ ! -d ./public ]] && mkdir ./public

# copy our CSS - <link rel="stylesheet" href="${_CSS}" />
_CSS="mdhtml.css"
cp "./style/${_CSS}" ./public/

# icon
_FAV="favicon.ico"
cp "./style/${_FAV}" ./public/

# ./src/foo.md -> ./public/foo.html
for file in ./src/*.md; do
  _FILE="${file##*/}"
  _HTML="./public/${_FILE%.*}.html"
  echo "Processing $file to $_HTML"

  # metadata for pandoc
  _TITLE=$(grep -m1 "^# " "$file" | sed -r 's/# //')

  # [foo](foo.md) -> [foo](foo.html)
  #  sed -i -r 's/(\[.*?\])\((.*?)\.md\)/\1(\2.html)/' "$file"
  # sed does not support non-greedy (.*?) like perl, we have to hack it
  sed -i -r \
    -e ':loop' \
    -e 's/(\[.*\])\((.*)\.md\)/\1(\2.html)/g' \
    -e 't loop' $file

  pandoc -s \
    -f gfm+gfm_auto_identifiers-ascii_identifiers \
    -t html \
    --template="./style/pandoc_html.tpl" \
    --include-in-header="./style/header_include.html" \
    --include-before-body="./style/body_before.html" \
    --include-after-body="./style/body_after.html" \
    --metadata pagetitle="$_TITLE" \
    --css="${_CSS}" \
    -o "${_HTML}" "$file"
done

The Markdown is written in pure generic GFM using the "4 spaces indent" style for sub-list items, the idea is that these documents display the same inside Gitlab/Github rendering as they do when generated to HTML and some CSS applied (we fixed a pandoc issue about a year ago related to these TOC entries not matching, I'm the same guy).

The template is almost the same as the default pandoc one, I had to use a unique one to override some of the CSS/HTML embedded in the internal template (it's been so long I forget what, exactly - something in the header is hard coded?) but just in case here's the TPL file referenced above:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="$lang$" xml:lang="$lang$"$if(dir)$ dir="$dir$"$endif$>
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
$for(author-meta)$
  <meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
  <meta name="dcterms.date" content="$date-meta$" />
$endif$
$if(keywords)$
  <meta name="keywords" content="$for(keywords)$$keywords$$sep$, $endfor$" />
$endif$
  <title>$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$</title>
$if(quotes)$
  <style type="text/css">
      q { quotes: "“" "”" "‘" "’"; }
  </style>
$endif$
$if(highlighting-css)$
  <style type="text/css">
$highlighting-css$
  </style>
$endif$
$for(css)$
  <link rel="stylesheet" href="$css$" />
$endfor$
$if(math)$
  $math$
$endif$
$for(header-includes)$
  $header-includes$
$endfor$
</head>
<body>
$for(include-before)$
$include-before$
$endfor$
$if(title)$
<header id="title-block-header">
<h1 class="title">$title$</h1>
$if(subtitle)$
<p class="subtitle">$subtitle$</p>
$endif$
$for(author)$
<p class="author">$author$</p>
$endfor$
$if(date)$
<p class="date">$date$</p>
$endif$
</header>
$endif$
$if(toc)$
<nav id="$idprefix$TOC">
$table-of-contents$
</nav>
$endif$
$body$
$for(include-after)$
$include-after$
$endfor$
</body>
</html>

I create the TOCs manually so they display inside Gitlab/Github, it's not pandoc generating the TOC example just to be clear. A different page which has no sub-lists in the TOC has it's HTML looking like this:

<h2 id="contents">Contents</h2>
<ul>
<li><p><a href="#overview">Overview</a></p></li>
<li><p><a href="#process">Process</a></p></li>
</ul>

Another document has the mix-and-match going on:

<h2 id="contents">Contents</h2>
<ul>
<li><p><a href="#prerequisites">Prerequisites</a></p>
<ul>
<li><a href="#ad-setup-information">AD Setup Information</a></li>
</ul></li>
<li><p><a href="#implementation">Implementation</a></p>
<ul>
<li><a href="#install-rpms">Install RPMs</a></li>
<li><a href="#dns-configuration">DNS Configuration</a></li>
<li><a href="#configure-kerberos">Configure Kerberos</a>
<ul>
<li><a href="#get-a-kerberos-ticket">Get a Kerberos ticket</a></li>
<li><a href="#list-the-ticket-provided">List the ticket provided</a></li>
<li><a href="#destroy-the-ticket">Destroy the ticket</a></li>
</ul></li>
<li><a href="#samba-configuration">Samba Configuration</a>
<ul>
<li><a href="#join-the-domain">Join the domain</a></li>
<li><a href="#configure-winbind-authentication">Configure winbind authentication</a></li>
</ul></li>
<li><a href="#pam-configuration">PAM Configuration</a>
<ul>
<li><a href="#rhel5-and-rhel6">RHE5 and RHEL6</a></li>
<li><a href="#rhel6-only">RHEL6 Only</a></li>
</ul></li>
<li><a href="#parent-home-directory">Parent Home Directory</a></li>
</ul></li>
<li><p><a href="#testing">Testing</a></p></li>
<li><p><a href="#cached-logins">Cached Logins</a></p></li>
<li><p><a href="#user-crontabs">User crontabs</a></p></li>
<li><p><a href="#references">References</a></p></li>
</ul>

...this last one is interesting because sub-sub-lists (??) are missing it - so a sub-list without sub-sub has extra <p> but a sub-list with sub-sub-lists has no <p> added in the list elements. This one deserves another screenshot.

mixed

Something strange is afoot at the Circle-K...

@mb21
Copy link
Collaborator

mb21 commented Aug 5, 2020

Puzzling indeed, I can reproduce with the pandoc from homebrew:

pandoc 2.10.1
Compiled with pandoc-types 1.21, texmath 0.12.0.2, skylighting 0.8.5
Default user data directory: /Users/maurobieg/.local/share/pandoc or /Users/maurobieg/.pandoc
Copyright (C) 2006-2020 John MacFarlane
Web:  https://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
~ pandoc -f gfm
## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)







<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><p><a href="#apache-iptables-ports">Apache iptables Ports</a></p></li>
<li><p><a href="#apache-default-template">Apache Default Template</a></p></li>
<li><p><a href="#apache-80-template">Apache 80 Template</a></p></li>
<li><p><a href="#apache-443-template">Apache 443 Template</a></p></li>
</ul></li>
</ul>

I haven't master on this machine though... and are we sure try.pandoc is running 2.10.1 ?

@jgm
Copy link
Owner

jgm commented Aug 5, 2020

are we sure try.pandoc is running 2.10.1 ?

It puts the version (generated by the library) on the bottom of the page when you convert -- so yes.
Maybe the homebrew pandoc is not right?

@jgm
Copy link
Owner

jgm commented Aug 5, 2020

Can you put something into your pipeline that runs pandoc --version so we can confirm that the correct version is being run?

@jgm
Copy link
Owner

jgm commented Aug 5, 2020

And, I can reproduce this with the commonmark cli tool from commonmark-hs.
So, yes, this is an issue in commonmark-hs.
I'll open a new issue there.
jgm/commonmark-hs#56

EDIT: you can work around this in your pipeline by stripping excess blank lines before passing to pandoc.

@jgm jgm closed this as completed in 57417fe Aug 5, 2020
@troyengel
Copy link
Author

Excellent, thank you for diagnosing and fixing so quickly, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants