Docx reader - support new table features #6512

emdash-ie · 2020-07-08T21:42:02Z

I haven't done many of the features yet, but I figure it's good to get something up early to show what I'm doing (and in case I'm heading in wildly the wrong direction).

Will close #6316

mb21 · 2020-07-09T07:12:04Z

src/Text/Pandoc/Readers/Docx/Parse.hs

+data VMerge = Continue
+            -- ^ This cell should be merged with the one above it
+            | Restart
+            -- ^ This cell should not be merged with the one above it


I have no clue about the docx format's way or representing table spans, but I assume you checked whether you really need these intermediate data structures and cannot convert directly to the pandoc AST table type?

I could probably do without the intermediate data structures, if that'd be preferred – I had decided that it made sense to have an intermediate representation that captures the structure of the docx format, to separate reading the XML from converting from the format into the pandoc AST types.

For example, for the VMerge type here, the xml may contain:

no vMerge child, in which case the cell shouldn't be merged with the one above

a vMerge child with no val attribute, in which case the cell should be merged

a vMerge child with a val attribute of either continue or restart, meaning merged and not merged respectively

I figured it made sense to separate these XML rules from the conversion of the vMerge data into the pandoc AST type (which I haven't done yet, but I'm assuming will require looking first at the GridSpan values to group the cells into columns, followed by an iteration down each column to work out the row spans).

I can see that it could be adding unnecessary code though, so I'm happy to change it. What do you think?

(I'm following the 5th edition of ECMA-376 for the format. Let me know if there's another reference I should be using)

I haven't really looked at the vMerge attributes.. but yes, if it's sufficiently different from rowspan/colspan semantics (which the pandoc AST uses), then perhaps it makes sense to use an intermediate data structure... up to you I guess :) or maybe @jkr has an opinion?

about the word format: we try to be quite backwards-compatible... and usually we rely as much on testing files that were written by Word as the spec, as the two unfortunately are not always in sync :P

Ok, I'll see how it looks as I go and keep both approaches in mind :)

Ah, that makes sense! I'll try and check that I'm keeping things backwards-compatible. Will I need to add any new tests, or should existing tests cover it?

I don't think we have any docx reader tests that cover row- or col-spans currently... so adding a few would certainly be welcome!

Excellent, I will add some.

emdash-ie · 2020-08-31T11:05:03Z

@mb21 Can I ask some advice?

For table captions inserted using Word, there doesn't seem to be an explicit link from each caption to the table it describes – a caption is just a paragraph with a table number field, and references to the table are actually references to this field. The paragraph is automatically inserted next to the table (either before or after it), but can be moved around afterwards. It's also possible to insert multiple captions against the same table, which get sequential numbers – here's a screenshot to show what I mean:

I'm wondering what the best way to link up the captions and tables is. I think the best option would be to link them up sequentially, so the first table caption in the document with the first table in the document and so on, though this would involve dropping captions which reference tables that don't exist (e.g. for the document in the screenshot we'd keep "Table 1, which has stuff in it" and drop the other two captions).

What do you think?

mb21 · 2020-09-01T06:45:59Z

we usually try to do whatever makes most sense in regards to most documents found "in the wild", but I'm not sure what's that in this case... do you have a few sample docs lying around, maybe from coworkers?

Another consideration is that round-tripping should work: i.e. from pandoc native to word and back, but since #6315 is still open, we cannot test this.

emdash-ie · 2020-09-01T17:40:14Z

That sounds good, I'll have a look and see what I can find in general use.

I'll keep round-tripping in mind too. I think the approach I'm leaning towards should be fine, since it isn't possible to have a table caption without a table in the pandoc native format.

lrosenthol · 2020-09-10T16:46:09Z

Looking forward to this PR @undergroundquizscene - thanks!

Two other things to keep in mind:

making sure that internal document links to tables continue to work if you remove a caption.
not breaking any of the auto-numbering filters.

emdash-ie · 2020-09-28T21:04:07Z

Looking forward to this PR @undergroundquizscene - thanks!

No problem! I'll get there eventually 😅

* making sure that internal document links to tables continue to work if you remove a caption.

Does this functionality exist in Word? From my experimentation, it seems like the internal links are actually fundamentally links to the captions, so I have a feeling if there's no caption there's no link in the source Word document. I'll test this and see.

* not breaking any of the auto-numbering filters.

I'm not certain what you mean by "filters" here – do you just mean the auto-numbering of tables/table captions in general?

tshort · 2021-02-11T22:05:56Z

@undergroundquizscene, I tried this branch out. It's nice to be able to convert from MS-Word to HTML now and preserve colspans! I'm really looking forward to this.

One issue I ran into is that for some tables, I lost some of contents of the cells on the bottom of the table. I can send an example if that helps.

emdash-ie · 2021-02-11T22:34:48Z

@undergroundquizscene, I tried this branch out. It's nice to be able to convert from MS-Word to HTML now and preserve colspans! I'm really looking forward to this.

One issue I ran into is that for some tables, I lost some of contents of the cells on the bottom of the table. I can send an example if that helps.

Thanks for letting me know! I’ve been busy recently, but I’m hoping to make some more progress on this shortly. I think I had noticed what you’re describing at some point, but please do send on an example, it’ll be a big help.

tshort · 2021-02-16T01:23:45Z

src/Text/Pandoc/Readers/Docx.hs

+rowsToRows rows = do
+  let rowspans = (fmap . fmap) RowSpan (Docx.rowsToRowspans rows)
+  let rowCells = fmap (\(Docx.Row _ cs) -> cs) rows
+  cells <- zipWithM (zipWithM cellToCell) rowspans rowCells


I added a trace function here, and I think rowspans and rowCells have different lengths, so that's why cells at the end are coming up empty.

Yup, absolutely! Sorry, I didn't see this, just came back here to point that out. The problem is my rowsToRowspans function, which uses transpose to look at the columns and then transposes again to get back to rows: unfortunately that only works when every row has the same number of elements, which is not true here. I'll write a transpose that takes the column spans into account I reckon, that should sort that.

tshort · 2021-02-16T01:25:30Z

src/Text/Pandoc/Readers/Docx/Parse.hs

+    calculateColspan cell@(Cell _ vMerge _) (column, continuesBelow) =
+      case vMerge of
+        Restart -> ((cell, 1 + continuesBelow) : column, 0)
+        Continue -> (column, 1 + continuesBelow)


I'm a little confused on the naming here. When you have Restart and Continue, it seems like those are for rowspans, but this is named calculateColspan.

Yes, the name is incorrect. When I wrote it I had colspans and rowspans backwards in my head, and when I renamed everything I missed the names inside this function.

emdash-ie · 2021-02-19T22:02:17Z

src/Text/Pandoc/Readers/Docx.hs

-  -- pad cells.  New Text.Pandoc.Builder will do that for us,
-  -- so this is for compatibility while we switch over.
-  let cells' = map (\row -> toRow $ take width (row ++ repeat mempty)) cells
+  hdrCells <- fmap (>>= toHeaderRow) (traverse rowToRow hdr)


@tshort Also, since the header now allows multiple rows, this should be using rowsToRows instead of rowToRow, which is no longer needed.

tshort · 2021-02-27T13:43:35Z

@undergroundquizscene, after your recent update, it works great! It properly handled all the rowspans and colspans I've tried it with.

emdash-ie · 2021-03-04T22:43:36Z

src/Text/Pandoc/Readers/Docx/Parse.hs

+rowsToRowspans :: [Row] -> [[(Int, Cell)]]
+rowsToRowspans rows = let


@mb21 I want to unit test this function, but I can't figure out where to put the tests. I've tried test/Tests/Readers/Docx.hs, but I can't import the function in that file. Any idea how I can make that work?

ebahdc · 2021-05-03T21:10:38Z

A status update:

I've completed the following new table features listed in #6316:

rowspans
colspans
multiple header lines
optional short caption
captions that allow block-level content
- this may not currently handle fields (auto-numbered captions) correctly, I need to double-check that (and add a test for it)

I have not yet done:

row headers
- I'm not sure what this means, I'll check
table head and foot
table attributes
- as far as I can tell, tables in the docx format don't have identifiers, but I'll check what other attributes there are and see if they're represented in the docx format

emdash-ie · 2021-05-03T21:11:58Z

A status update:

I've completed the following new table features listed in #6316:

* rowspans

* colspans

* multiple header lines

* optional short caption

* captions that allow block-level content
  
  * this may not currently handle fields (auto-numbered captions) correctly, I need to double-check that (and add a test for it)

I have not yet done:

* row headers
  
  * I'm not sure what this means, I'll check

* table head and foot

* table attributes
  
  * as far as I can tell, tables in the docx format don't have identifiers, but I'll check what other attributes there are and see if they're represented in the docx format

(Oops, this is my other account! Was signed in with the wrong one)

tarleb · 2021-05-04T15:21:15Z

This is great! @undergroundquizscene, do you have time to zoom about this some time? I'm working on the writer and would be curious about your input.

emdash-ie · 2021-05-06T21:16:02Z

This is great! @undergroundquizscene, do you have time to zoom about this some time? I'm working on the writer and would be curious about your input.

Sure, let me email you directly and we’ll work out a time

* Column spans * Row spans - The spec says that if the `val` attribute is ommitted, its value should be assumed to be `continue`, and that its values are restricted to {`restart`, `continue`}. If the value has any other value, I think it seems reasonable to default it to `continue`. It might cause problems if the spec is extended in the future by adding a third possible value, in which case this would probably give incorrect behaviour, and wouldn't error. * Allow multiple header rows * Include table description in simple caption - The table description element is like alt text for a table (along with the table caption element). It seems like we should include this somewhere, but I’m not 100% sure how – I’m pairing it with the simple caption for the moment. (Should it maybe go in the block caption instead?) * Detect table captions - Check for caption paragraph style /and/ either the simple or complex table field. This means the caption detection fails for captions which don’t contain a field, as in an example doc I added as a test. However, I think it’s better to be too conservative: a missed table caption will still show up as a paragraph next to the table, whereas if I incorrectly classify something else as a table caption it could cause havoc by pairing it up with a table it’s not at all related to, or dropping it entirely. * Update tests and add new ones

I noticed this wasn’t done yet, and it was a simple change. Note the column widths (which are doubles) are now displayed in the native for the tests: is that ok? I wonder about rounding errors. Would it be better to use a `Rational` here instead (especially since I’m creating them by division)?

emdash-ie · 2021-05-26T20:29:24Z

Ok, I believe this is ready to be merged. I've done what I listed in my previous message, with the exception that there's only partial support for block-level content within captions: putting an ordered list in the caption in Word creates a new paragraph, for example, which is then not included in the caption. However, I think it's an improvement over the current situation, so I think it's worth merging as-is while I work on the other parts.

tarleb

Wonderful work!

jgm · 2021-05-27T15:49:04Z

We're still getting some CI failures here.

tarleb · 2021-05-28T09:50:30Z

src/Text/Pandoc/Readers/Docx.hs

+      return (Just c)
+    [] -> return Nothing
+  let shortCaption = if T.null cap then Nothing else Just (toList (text cap))
+      cap' = caption shortCaption (fromMaybe mempty fullCaption)


Would it make sense to default to plain (text cap) instead of mempty for the full caption?

Would that not duplicate the caption text (as it's going in the short caption as well)? I don't have the clearest idea of when the full caption or short caption is used.

tarleb · 2021-05-28T09:58:45Z

Any objections if I cherry-pick and amend the commits, so we can get this into the next release?

emdash-ie · 2021-05-28T16:28:38Z

We're still getting some CI failures here.

I'll fix these now: I think it just requires updating the golden docx documents to match the test output.

emdash-ie · 2021-05-28T16:32:17Z

Any objections if I cherry-pick and amend the commits, so we can get this into the next release?

No objections here, go for it :)

tarleb · 2021-05-28T17:29:43Z

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.
The idea about using Rational for column widths should probably be raised here: jgm/pandoc-types#86

emdash-ie · 2021-05-28T17:55:39Z

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.

Sure, I'll take a look now.

emdash-ie · 2021-05-28T18:03:11Z

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.
The idea about using Rational for column widths should probably be raised here: jgm/pandoc-types#86

Looks good! I actually can't see what you're saying about the caption field though: mind pointing me to the specific bit?

tarleb · 2021-05-28T18:05:19Z

Sorry, I forgot to push that change. It's here now: https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1#diff-22df782fded6b5f41521820cbf8de809ecd9835ce95f74fa2ca4eb49e0b0831dR675

emdash-ie · 2021-05-28T18:08:49Z

Sorry, I forgot to push that change. It's here now: https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1#diff-22df782fded6b5f41521820cbf8de809ecd9835ce95f74fa2ca4eb49e0b0831dR675

Looks good, thanks :)

emdash-ie · 2021-05-28T18:15:58Z

@tarleb Will I close this PR then? I plan to work on more of the docx table features over the next while, but it's probably easier to put them in a new PR rather than this one.

tarleb · 2021-05-28T18:18:12Z

I think so, too. For reference: the commits have been pushed as 44484d0 and 56b2111.

Thanks again!

emdash-ie · 2021-05-28T18:19:01Z

I saw it automatically closed #6316: we'll probably want to re-open that as some of the table features are not yet supported.

mb21 reviewed Jul 9, 2020

View reviewed changes

tshort reviewed Feb 16, 2021

View reviewed changes

emdash-ie commented Feb 19, 2021

View reviewed changes

emdash-ie commented Mar 4, 2021

View reviewed changes

Emily Bourke and others added 2 commits May 26, 2021 18:57

emdash-ie marked this pull request as ready for review May 26, 2021 20:16

tarleb approved these changes May 27, 2021

View reviewed changes

tarleb reviewed May 28, 2021

View reviewed changes

emdash-ie changed the title ~~WIP: Docx reader - support new table features~~ Docx reader - support new table features May 28, 2021

emdash-ie closed this May 28, 2021

emdash-ie deleted the 6316-docx-reader-new-table-features branch May 28, 2021 18:19

emdash-ie mentioned this pull request May 28, 2021

Docx reader - support new table features #6316

Open

		rowsToRowspans :: [Row] -> [[(Int, Cell)]]
		rowsToRowspans rows = let

Docx reader - support new table features #6512

Docx reader - support new table features #6512

Conversation

emdash-ie commented Jul 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mb21 Jul 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emdash-ie commented Aug 31, 2020

mb21 commented Sep 1, 2020 • edited Loading

emdash-ie commented Sep 1, 2020

lrosenthol commented Sep 10, 2020

emdash-ie commented Sep 28, 2020

tshort commented Feb 11, 2021

emdash-ie commented Feb 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tshort commented Feb 27, 2021

Choose a reason for hiding this comment

ebahdc commented May 3, 2021

emdash-ie commented May 3, 2021

tarleb commented May 4, 2021

emdash-ie commented May 6, 2021

emdash-ie commented May 26, 2021

tarleb left a comment

Choose a reason for hiding this comment

jgm commented May 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarleb commented May 28, 2021

emdash-ie commented May 28, 2021

emdash-ie commented May 28, 2021

tarleb commented May 28, 2021 • edited Loading

emdash-ie commented May 28, 2021

emdash-ie commented May 28, 2021

tarleb commented May 28, 2021

emdash-ie commented May 28, 2021

emdash-ie commented May 28, 2021

tarleb commented May 28, 2021 • edited Loading

emdash-ie commented May 28, 2021

mb21 Jul 18, 2020 •

edited

Loading

mb21 commented Sep 1, 2020 •

edited

Loading

tarleb commented May 28, 2021 •

edited

Loading

tarleb commented May 28, 2021 •

edited

Loading