openxml raw inline/block without parsing? #6933

jjallaire · 2020-12-07T21:31:12Z

I've noticed that when creating a RawInline or RawBlock of type "openxml" that the XML is actually parsed into a valid xml fragment (closing tags as necessary). Looks like that is happening here:

pandoc/src/Text/Pandoc/Writers/Docx.hs

Lines 974 to 978 in 5bbd5a9

    
           blockToOpenXML' _ b@(RawBlock format str) 
        
             | format == Format "openxml" = return [ x | Elem x <- parseXML str ] 
        
             | otherwise                  = do 
        
                 report $ BlockNotRendered b 
        
                 return []

I'm wondering if this is a hard requirement or if an option to avoid this could be provided? The use case is constructing more elaborate structures (e.g. tables) where we embed standard markdown tokens inside raw structures. This is for example done in the pandoc-crossref filter to provide LaTeX figure/subfigure layout (where the raw tokens provide the figure/subfigure LaTeX structure and then the markdown tokens are used to actually render the figures.

I was hoping to create a similar feature that enabled grid based figure layouts for docx, but am stuck on this constraint as e.g this token:

pandoc.RawInline("latex", "<w:tbl>")

Ends up in the document like this:

<w:tbl></w:tbl>

The issue w/ just using a pandoc.Table is that it doesn't appear as if you can get precise percentage-based layout for markdown tables emitted to docx, nor can you (currently) set cell based alignment. So I don't think you can do something like this: http://lierdakil.github.io/pandoc-crossref/#subfigure-grid

I may be missing something here but just wanted to record the constraints I'm seeing and hoping there is a way around them either w/ an existing behavior or perhaps a new one.

The text was updated successfully, but these errors were encountered:

jgm · 2020-12-08T00:03:00Z

The problem is that the writer is building up a data structure that represents an XML document tree. We can't just insert a string into this -- only a node of an XML document -- an element, or some plain text.

it doesn't appear as if you can get precise percentage-based layout for markdown tables emitted to docx

To the contrary, the docx writer does take into account width information on table cells, emitting a w:w attribute (width) inside w:gridCol. It's possible that there's some issue with this, but if so we should address it.

jjallaire · 2020-12-08T01:37:19Z

Okay, I thought it might be something like that so no real way around that constraint.

I'll do some additional experimentation with the table columns. One issue may be that while the columns are sized correctly if <w:tblLayout w:type="fixed"/> isn't used then Word is free to re-adjust them based on their content. The trick may be to size both the columns and their contents. I'll dig further into using markdown tables and report back on what I find.

jjallaire · 2020-12-08T01:43:57Z

Just recalled that another issue I was trying to solve for in emitting raw openxml was a table which has distinct column sizing per-row (for more sophisticated figure layouts). With the new table AST this will be possible but until that's supported in the docx writer the best solution might be to use native pandoc column sizing and create a separate table per row of figures. Again, will report back after further experimentation.

jjallaire · 2020-12-08T13:02:37Z

It turns out that using a combination of columns widths and underlying figure widths you can indeed do arbitrary horizontal layout w/ in-cell alignment. The key is to distribute the columns widths evenly to add up to 1.0 (so that the full horizontal width of the page is occupied) and then to explicitly size the figures using physical units (inches). Word ends up resizing the columns but this is done so w/r/t to the content widths so it comes out the way you want.

Here's an example of what a so composed figure panel looks like rendered in docx:

A couple of related requests (LMK if I should open separate issues for these):

When I add multiple tables to the AST we end up with an extra (empty) <w:p> tag between the tables (you can see this extra vertical space in the image above). Is there another thing I could insert explicitly that would close the vertical gap?
The caption for the entire panel is created w/ the following openxml:

<w:p>
  <w:pPr>
    <w:jc w:val="center"/>
    <w:pStyle w:val="ImageCaption"/>
  </w:pPr>
  <w:r>
    <w:t xml:space="preserve">Figure 1: Full Caption</w:t>
  </w:r>
</w:p>

Unfortunately the actual caption text had to be inserted using pandoc.utils.stringify (to create a fully valid xml node as-per the above discussion) so we lose any markdown formatting it had. I understand that we can't compose XML in fragments, but perhaps we could go the other way and call a Lua function to render the caption into an arbitrary raw format, e.g. pandoc.utils.render("openxml", captionEl).

It's not a huge problem to lose the markdown in the caption as it's probably somewhat rare (although for some disciplines it seems like math would be a frequent requirement).

jjallaire · 2020-12-08T14:04:22Z

For (1) above, inserting an empty "openxml" RawBlock between the tables seems to be enough to prevent the automatic insertion of an empty paragraph.

jgm · 2020-12-08T19:56:21Z

When I add multiple tables to the AST we end up with an extra (empty) <w:p> tag between the tables (you can see this extra vertical space in the image above).

See #4315 and commit 93e3d46 for the motivation for this. Your workaround seems good.

I don't currently see a good way around the other issue. In principle we could add Lua support for rendering -- I think there may be an issue for this already -- but we don't even expose an openxml writer currently.

jjallaire · 2020-12-08T19:58:54Z

My initial workaround ended up with the combined tables noted in #4315. Here is where I landed (a zero-height text frame):

<w:p>
  <w:pPr>
    <w:framePr w:w="0" w:h="0" w:vAnchor="margin" w:hAnchor="margin" w:xAlign="right" w:yAlign="top"/>
  </w:pPr>
</w:p>

tarleb · 2020-12-09T10:47:46Z

I had a look at the Docx writer and believe that the initial request, i.e., adding raw openxml blocks without parsing, would be feasible with moderate effort.

The writer currently uses lists of Elements as building blocks. I believe that it wouldn't be too hard to generalize this and use Content instead. This would then allow to pack the raw blocks into raw CData elements.

Is this worthwhile, and should the issue be reopened?

jjallaire · 2020-12-09T11:50:50Z

In my view it's extremely powerful to be able to intermix raw markup with pandoc tokens. This is used to great effect in pandoc-crossref figure layout, where arbitrary raw tex composes a structure (a subfigure grid) but then allows Pandoc to render the actual figures. The alternative if this weren't possible would be to emit the figures using additional raw markup, but then they are essentially lost from the AST for downstream processing by other filters (and we lose whatever other desirable native behaviors pandoc has). This also becomes relevant for captions, as you really want to allow markup in captions (again, emitting the caption entirely using raw markup requires pandoc.utils.stringify).

In LaTeX or HTML it's straightforward enough to emit raw markup for figures, so if are you willing to accept the tradeoffs of erasing the figure from the AST and not supporting markup in the caption you can at least do it. For docx though, emitting figures is more complex (they need to be properly embedded in the zip file) so we really need pandoc to do this processing from the AST.

You can imagine other scenarios where emitting raw openxml would be desirable: for example, in a PowerPoint writer you might want to emit multiple "frames" of content on a slide. If we could emit partial XML structures then this would be possible for a filter (it could emit the begin and end frame xml literally, and let pandoc fill in the middle with standard markup processing).

tarleb · 2020-12-09T13:02:02Z

Reopened. I may be able to give it a try later this week.

For docx though, emitting figures is more complex (they need to be properly embedded in the zip file) so we really need pandoc to do this processing from the AST.

Lua filters have access to pandoc's "mediabag", so I believe it might be (well, become) possible to handle that in a filter.

jjallaire · 2020-12-09T13:36:10Z

If you do give it a try then LMK and I'll test immediately with our use case.

Lua filters have access to pandoc's "mediabag", so I believe it might be (well, become) possible to handle that in a filter.

That's true, but the Pandoc code required to emit docx images is from what I can see quite a bit more involved than for LaTeX or HTML:

pandoc/src/Text/Pandoc/Writers/Docx.hs

Line 1374 in 5bbd5a9

inlineToOpenXML' opts (Image attr@(imgident, _, _) alt (src, title)) = do

So it's a huge bonus to have pandoc write the image directly from an Image token.

Closes: jgm#6933

tarleb · 2020-12-09T22:31:59Z

Right, I had misunderstood what you meant.

I wanted to get a first draft done while this was still fresh in my mind, so here we go: #6941. I may rewrite some details and didn't add tests yet, but it should work as desired.

Closes: jgm#6933

jjallaire · 2020-12-13T19:12:26Z

Thanks again, this is a really terrific advance for creating sophisticated docx/pptx output!

tarleb · 2020-12-13T19:52:32Z

Most welcome. This doesn't work with pptx output yet, but a similar change should be possible to enable it. I'm a bit short on time next week, but I could take another look later this month.

jjallaire · 2020-12-13T20:24:47Z

Okay, LMK if you do take a run at pptx and I will put it through it's paces.

jjallaire closed this as completed Dec 9, 2020

tarleb reopened this Dec 9, 2020

tarleb added a commit to tarleb/pandoc that referenced this issue Dec 9, 2020

Docx reader: keep rwa openxml strings verbatim

6ddc921

Closes: jgm#6933

tarleb added a commit to tarleb/pandoc that referenced this issue Dec 10, 2020

Docx reader: keep rwa openxml strings verbatim

20bddae

Closes: jgm#6933

tarleb added a commit to tarleb/pandoc that referenced this issue Dec 13, 2020

Docx writer: keep raw openxml strings verbatim.

a86cd0d

Closes: jgm#6933

jgm closed this as completed in 00031fc Dec 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openxml raw inline/block without parsing? #6933

openxml raw inline/block without parsing? #6933

jjallaire commented Dec 7, 2020

jgm commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jgm commented Dec 8, 2020

jjallaire commented Dec 8, 2020

tarleb commented Dec 9, 2020

jjallaire commented Dec 9, 2020

tarleb commented Dec 9, 2020 •

edited

Loading

jjallaire commented Dec 9, 2020

tarleb commented Dec 9, 2020

jjallaire commented Dec 13, 2020

tarleb commented Dec 13, 2020 •

edited

Loading

jjallaire commented Dec 13, 2020

openxml raw inline/block without parsing? #6933

openxml raw inline/block without parsing? #6933

Comments

jjallaire commented Dec 7, 2020

jgm commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jjallaire commented Dec 8, 2020

jgm commented Dec 8, 2020

jjallaire commented Dec 8, 2020

tarleb commented Dec 9, 2020

jjallaire commented Dec 9, 2020

tarleb commented Dec 9, 2020 • edited Loading

jjallaire commented Dec 9, 2020

tarleb commented Dec 9, 2020

jjallaire commented Dec 13, 2020

tarleb commented Dec 13, 2020 • edited Loading

jjallaire commented Dec 13, 2020

tarleb commented Dec 9, 2020 •

edited

Loading

tarleb commented Dec 13, 2020 •

edited

Loading