-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: common intermediate JSON for all pandoc outputs #196
Conversation
Exporting to an intermediate JSON breaks table & equation anchorsOur project uses pandoc to export to multiple formats (HTML, PDF, DOCX, and in the future JATS). Here we investigate using a unified pandoc command to convert to Pandoc's Abstract Syntax Tree (i.e For some reason, 896d2ca breaks the anchors for equations and tables, but not figures. As seen in the build log, WeasyPrint complains about the broken anchors:
In the HTML output prior to this pull request: <span>Table 1:</span> A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tbl:bowling-scores"> In the HTML output when processing through an intermediate JSON AST: Table 1: A table with a top caption and specified relative column widths. <a title="Link to this part of the document" class="icon_button anchor" data-ignore="true" href="#tables"> Note that @jgm / @vincerubinetti any idea what could be breaking? Does the AST strip out equation and table identifiers? |
This broken table & equation (but not figure) anchors has been a difficult issue for me to diagnose. My current understanding is that pandoc does not provide a way to set the I tried using A few questions for @lierdakil:
|
Honestly, not sure. You might save a few CPU cycles this way, but can't say from the top of my head how much exactly. One thing I will note is that approach you're taking in this pull request does not make sense. If you're using Pandoc JSON anyway, it seems logical to use filters as a pipe instead of passing those as
It does. Specifically, output is slightly adjusted in case output format is docx (aka OOXML), to apply caption style to some ad-hoc captions, and is adjusted a lot when outputting to LaTeX (or PDF via LaTeX). If you use JSON intermediary explicitly, pandoc-crossref will not do that, instead showing "generic" behaviour, unless you pass the output format also as the first argument to Fair warning: LaTeX output in pandoc-crossref is a bit of a different beast from the rest of those. Outputting to, say, docx and LaTeX with default options will produce results that differ in surprising ways. If you want consistency, using pandoc-crossref in "pipe mode" (i.e. |
Thanks @lierdakil for your explanation. It seems that it really only makes sense to consolidate the conversion pipeline for filters that are output-format agnostic. Since none of the filters we currently use are actually output-format agnostic, I will close this pull request. If we do incorporate output-format agnostic filters, such as the cite-by-id filter under development in manubot/manubot#99, we can revisit this proposal. Going forward we will keep pandoc-crossref in mind, especially if our current suite of pandoc-xnos filters starts experiencing limitations. |
We use pandoc to convert from markdown to several output formats, such as HTML, DOCX, PDF, and JATS in the future. Therefore, does it make sense to streamline the pandoc pipeline to do as much of the shared conversion together by saving an intermediate Pandoc JSON abstract syntax tree?
The benefits would be more assurance that different outputs include the same processing filters. The output-specific steps would be more clearly delineated from the common processing steps. Build times may improve slightly. We will also retain the
manuscript.json
file, which could be helpful for debugging and converting to additional formats down the road.Draft.
Refs jgm/pandoc#3211 (comment):
Todo: