Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openxml raw inline/block without parsing? #6933

Closed
jjallaire opened this issue Dec 7, 2020 · 15 comments
Closed

openxml raw inline/block without parsing? #6933

jjallaire opened this issue Dec 7, 2020 · 15 comments

Comments

@jjallaire
Copy link

I've noticed that when creating a RawInline or RawBlock of type "openxml" that the XML is actually parsed into a valid xml fragment (closing tags as necessary). Looks like that is happening here:

blockToOpenXML' _ b@(RawBlock format str)
| format == Format "openxml" = return [ x | Elem x <- parseXML str ]
| otherwise = do
report $ BlockNotRendered b
return []

I'm wondering if this is a hard requirement or if an option to avoid this could be provided? The use case is constructing more elaborate structures (e.g. tables) where we embed standard markdown tokens inside raw structures. This is for example done in the pandoc-crossref filter to provide LaTeX figure/subfigure layout (where the raw tokens provide the figure/subfigure LaTeX structure and then the markdown tokens are used to actually render the figures.

I was hoping to create a similar feature that enabled grid based figure layouts for docx, but am stuck on this constraint as e.g this token:

pandoc.RawInline("latex", "<w:tbl>")

Ends up in the document like this:

<w:tbl></w:tbl>

The issue w/ just using a pandoc.Table is that it doesn't appear as if you can get precise percentage-based layout for markdown tables emitted to docx, nor can you (currently) set cell based alignment. So I don't think you can do something like this: http://lierdakil.github.io/pandoc-crossref/#subfigure-grid

I may be missing something here but just wanted to record the constraints I'm seeing and hoping there is a way around them either w/ an existing behavior or perhaps a new one.

@jgm
Copy link
Owner

jgm commented Dec 8, 2020

The problem is that the writer is building up a data structure that represents an XML document tree. We can't just insert a string into this -- only a node of an XML document -- an element, or some plain text.

it doesn't appear as if you can get precise percentage-based layout for markdown tables emitted to docx

To the contrary, the docx writer does take into account width information on table cells, emitting a w:w attribute (width) inside w:gridCol. It's possible that there's some issue with this, but if so we should address it.

@jjallaire
Copy link
Author

Okay, I thought it might be something like that so no real way around that constraint.

I'll do some additional experimentation with the table columns. One issue may be that while the columns are sized correctly if <w:tblLayout w:type="fixed"/> isn't used then Word is free to re-adjust them based on their content. The trick may be to size both the columns and their contents. I'll dig further into using markdown tables and report back on what I find.

@jjallaire
Copy link
Author

Just recalled that another issue I was trying to solve for in emitting raw openxml was a table which has distinct column sizing per-row (for more sophisticated figure layouts). With the new table AST this will be possible but until that's supported in the docx writer the best solution might be to use native pandoc column sizing and create a separate table per row of figures. Again, will report back after further experimentation.

@jjallaire
Copy link
Author

It turns out that using a combination of columns widths and underlying figure widths you can indeed do arbitrary horizontal layout w/ in-cell alignment. The key is to distribute the columns widths evenly to add up to 1.0 (so that the full horizontal width of the page is occupied) and then to explicitly size the figures using physical units (inches). Word ends up resizing the columns but this is done so w/r/t to the content widths so it comes out the way you want.

Here's an example of what a so composed figure panel looks like rendered in docx:

Screen Shot 2020-12-08 at 7 56 11 AM

A couple of related requests (LMK if I should open separate issues for these):

  1. When I add multiple tables to the AST we end up with an extra (empty) <w:p> tag between the tables (you can see this extra vertical space in the image above). Is there another thing I could insert explicitly that would close the vertical gap?

  2. The caption for the entire panel is created w/ the following openxml:

<w:p>
  <w:pPr>
    <w:jc w:val="center"/>
    <w:pStyle w:val="ImageCaption"/>
  </w:pPr>
  <w:r>
    <w:t xml:space="preserve">Figure 1: Full Caption</w:t>
  </w:r>
</w:p>

Unfortunately the actual caption text had to be inserted using pandoc.utils.stringify (to create a fully valid xml node as-per the above discussion) so we lose any markdown formatting it had. I understand that we can't compose XML in fragments, but perhaps we could go the other way and call a Lua function to render the caption into an arbitrary raw format, e.g. pandoc.utils.render("openxml", captionEl).

It's not a huge problem to lose the markdown in the caption as it's probably somewhat rare (although for some disciplines it seems like math would be a frequent requirement).

@jjallaire
Copy link
Author

For (1) above, inserting an empty "openxml" RawBlock between the tables seems to be enough to prevent the automatic insertion of an empty paragraph.

@jgm
Copy link
Owner

jgm commented Dec 8, 2020

When I add multiple tables to the AST we end up with an extra (empty) <w:p> tag between the tables (you can see this extra vertical space in the image above).

See #4315 and commit 93e3d46 for the motivation for this. Your workaround seems good.

I don't currently see a good way around the other issue. In principle we could add Lua support for rendering -- I think there may be an issue for this already -- but we don't even expose an openxml writer currently.

@jjallaire
Copy link
Author

My initial workaround ended up with the combined tables noted in #4315. Here is where I landed (a zero-height text frame):

<w:p>
  <w:pPr>
    <w:framePr w:w="0" w:h="0" w:vAnchor="margin" w:hAnchor="margin" w:xAlign="right" w:yAlign="top"/>
  </w:pPr>
</w:p>

@tarleb
Copy link
Collaborator

tarleb commented Dec 9, 2020

I had a look at the Docx writer and believe that the initial request, i.e., adding raw openxml blocks without parsing, would be feasible with moderate effort.

The writer currently uses lists of Elements as building blocks. I believe that it wouldn't be too hard to generalize this and use Content instead. This would then allow to pack the raw blocks into raw CData elements.

Is this worthwhile, and should the issue be reopened?

@jjallaire
Copy link
Author

In my view it's extremely powerful to be able to intermix raw markup with pandoc tokens. This is used to great effect in pandoc-crossref figure layout, where arbitrary raw tex composes a structure (a subfigure grid) but then allows Pandoc to render the actual figures. The alternative if this weren't possible would be to emit the figures using additional raw markup, but then they are essentially lost from the AST for downstream processing by other filters (and we lose whatever other desirable native behaviors pandoc has). This also becomes relevant for captions, as you really want to allow markup in captions (again, emitting the caption entirely using raw markup requires pandoc.utils.stringify).

In LaTeX or HTML it's straightforward enough to emit raw markup for figures, so if are you willing to accept the tradeoffs of erasing the figure from the AST and not supporting markup in the caption you can at least do it. For docx though, emitting figures is more complex (they need to be properly embedded in the zip file) so we really need pandoc to do this processing from the AST.

You can imagine other scenarios where emitting raw openxml would be desirable: for example, in a PowerPoint writer you might want to emit multiple "frames" of content on a slide. If we could emit partial XML structures then this would be possible for a filter (it could emit the begin and end frame xml literally, and let pandoc fill in the middle with standard markup processing).

@tarleb tarleb reopened this Dec 9, 2020
@tarleb
Copy link
Collaborator

tarleb commented Dec 9, 2020

Reopened. I may be able to give it a try later this week.

For docx though, emitting figures is more complex (they need to be properly embedded in the zip file) so we really need pandoc to do this processing from the AST.

Lua filters have access to pandoc's "mediabag", so I believe it might be (well, become) possible to handle that in a filter.

@jjallaire
Copy link
Author

If you do give it a try then LMK and I'll test immediately with our use case.

Lua filters have access to pandoc's "mediabag", so I believe it might be (well, become) possible to handle that in a filter.

That's true, but the Pandoc code required to emit docx images is from what I can see quite a bit more involved than for LaTeX or HTML:

inlineToOpenXML' opts (Image attr@(imgident, _, _) alt (src, title)) = do

So it's a huge bonus to have pandoc write the image directly from an Image token.

tarleb added a commit to tarleb/pandoc that referenced this issue Dec 9, 2020
@tarleb
Copy link
Collaborator

tarleb commented Dec 9, 2020

Right, I had misunderstood what you meant.

I wanted to get a first draft done while this was still fresh in my mind, so here we go: #6941. I may rewrite some details and didn't add tests yet, but it should work as desired.

tarleb added a commit to tarleb/pandoc that referenced this issue Dec 10, 2020
tarleb added a commit to tarleb/pandoc that referenced this issue Dec 13, 2020
@jgm jgm closed this as completed in 00031fc Dec 13, 2020
@jjallaire
Copy link
Author

Thanks again, this is a really terrific advance for creating sophisticated docx/pptx output!

@tarleb
Copy link
Collaborator

tarleb commented Dec 13, 2020

Most welcome. This doesn't work with pptx output yet, but a similar change should be possible to enable it. I'm a bit short on time next week, but I could take another look later this month.

@jjallaire
Copy link
Author

Okay, LMK if you do take a run at pptx and I will put it through it's paces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants