Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx reader - support new table features #6512

Closed
wants to merge 2 commits into from
Closed

Docx reader - support new table features #6512

wants to merge 2 commits into from

Conversation

emdash-ie
Copy link
Contributor

I haven't done many of the features yet, but I figure it's good to get something up early to show what I'm doing (and in case I'm heading in wildly the wrong direction).

Will close #6316

data VMerge = Continue
-- ^ This cell should be merged with the one above it
| Restart
-- ^ This cell should not be merged with the one above it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no clue about the docx format's way or representing table spans, but I assume you checked whether you really need these intermediate data structures and cannot convert directly to the pandoc AST table type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could probably do without the intermediate data structures, if that'd be preferred – I had decided that it made sense to have an intermediate representation that captures the structure of the docx format, to separate reading the XML from converting from the format into the pandoc AST types.

For example, for the VMerge type here, the xml may contain:

  • no vMerge child, in which case the cell shouldn't be merged with the one above
  • a vMerge child with no val attribute, in which case the cell should be merged
  • a vMerge child with a val attribute of either continue or restart, meaning merged and not merged respectively

I figured it made sense to separate these XML rules from the conversion of the vMerge data into the pandoc AST type (which I haven't done yet, but I'm assuming will require looking first at the GridSpan values to group the cells into columns, followed by an iteration down each column to work out the row spans).

I can see that it could be adding unnecessary code though, so I'm happy to change it. What do you think?

(I'm following the 5th edition of ECMA-376 for the format. Let me know if there's another reference I should be using)

Copy link
Collaborator

@mb21 mb21 Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really looked at the vMerge attributes.. but yes, if it's sufficiently different from rowspan/colspan semantics (which the pandoc AST uses), then perhaps it makes sense to use an intermediate data structure... up to you I guess :) or maybe @jkr has an opinion?

about the word format: we try to be quite backwards-compatible... and usually we rely as much on testing files that were written by Word as the spec, as the two unfortunately are not always in sync :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll see how it looks as I go and keep both approaches in mind :)

Ah, that makes sense! I'll try and check that I'm keeping things backwards-compatible. Will I need to add any new tests, or should existing tests cover it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any docx reader tests that cover row- or col-spans currently... so adding a few would certainly be welcome!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, I will add some.

@emdash-ie
Copy link
Contributor Author

@mb21 Can I ask some advice?

For table captions inserted using Word, there doesn't seem to be an explicit link from each caption to the table it describes – a caption is just a paragraph with a table number field, and references to the table are actually references to this field. The paragraph is automatically inserted next to the table (either before or after it), but can be moved around afterwards. It's also possible to insert multiple captions against the same table, which get sequential numbers – here's a screenshot to show what I mean:

Screenshot 2020-08-31 at 11 59 32

I'm wondering what the best way to link up the captions and tables is. I think the best option would be to link them up sequentially, so the first table caption in the document with the first table in the document and so on, though this would involve dropping captions which reference tables that don't exist (e.g. for the document in the screenshot we'd keep "Table 1, which has stuff in it" and drop the other two captions).

What do you think?

@mb21
Copy link
Collaborator

mb21 commented Sep 1, 2020

we usually try to do whatever makes most sense in regards to most documents found "in the wild", but I'm not sure what's that in this case... do you have a few sample docs lying around, maybe from coworkers?

Another consideration is that round-tripping should work: i.e. from pandoc native to word and back, but since #6315 is still open, we cannot test this.

@emdash-ie
Copy link
Contributor Author

That sounds good, I'll have a look and see what I can find in general use.

I'll keep round-tripping in mind too. I think the approach I'm leaning towards should be fine, since it isn't possible to have a table caption without a table in the pandoc native format.

@lrosenthol
Copy link
Contributor

Looking forward to this PR @undergroundquizscene - thanks!

Two other things to keep in mind:

  • making sure that internal document links to tables continue to work if you remove a caption.
  • not breaking any of the auto-numbering filters.

@emdash-ie
Copy link
Contributor Author

Looking forward to this PR @undergroundquizscene - thanks!

No problem! I'll get there eventually 😅

* making sure that internal document links to tables continue to work if you remove a caption.

Does this functionality exist in Word? From my experimentation, it seems like the internal links are actually fundamentally links to the captions, so I have a feeling if there's no caption there's no link in the source Word document. I'll test this and see.

* not breaking any of the auto-numbering filters.

I'm not certain what you mean by "filters" here – do you just mean the auto-numbering of tables/table captions in general?

@tshort
Copy link

tshort commented Feb 11, 2021

@undergroundquizscene, I tried this branch out. It's nice to be able to convert from MS-Word to HTML now and preserve colspans! I'm really looking forward to this.

One issue I ran into is that for some tables, I lost some of contents of the cells on the bottom of the table. I can send an example if that helps.

@emdash-ie
Copy link
Contributor Author

@undergroundquizscene, I tried this branch out. It's nice to be able to convert from MS-Word to HTML now and preserve colspans! I'm really looking forward to this.

One issue I ran into is that for some tables, I lost some of contents of the cells on the bottom of the table. I can send an example if that helps.

Thanks for letting me know! I’ve been busy recently, but I’m hoping to make some more progress on this shortly. I think I had noticed what you’re describing at some point, but please do send on an example, it’ll be a big help.

rowsToRows rows = do
let rowspans = (fmap . fmap) RowSpan (Docx.rowsToRowspans rows)
let rowCells = fmap (\(Docx.Row _ cs) -> cs) rows
cells <- zipWithM (zipWithM cellToCell) rowspans rowCells
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a trace function here, and I think rowspans and rowCells have different lengths, so that's why cells at the end are coming up empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, absolutely! Sorry, I didn't see this, just came back here to point that out. The problem is my rowsToRowspans function, which uses transpose to look at the columns and then transposes again to get back to rows: unfortunately that only works when every row has the same number of elements, which is not true here. I'll write a transpose that takes the column spans into account I reckon, that should sort that.

calculateColspan cell@(Cell _ vMerge _) (column, continuesBelow) =
case vMerge of
Restart -> ((cell, 1 + continuesBelow) : column, 0)
Continue -> (column, 1 + continuesBelow)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused on the naming here. When you have Restart and Continue, it seems like those are for rowspans, but this is named calculateColspan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the name is incorrect. When I wrote it I had colspans and rowspans backwards in my head, and when I renamed everything I missed the names inside this function.

-- pad cells. New Text.Pandoc.Builder will do that for us,
-- so this is for compatibility while we switch over.
let cells' = map (\row -> toRow $ take width (row ++ repeat mempty)) cells
hdrCells <- fmap (>>= toHeaderRow) (traverse rowToRow hdr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tshort Also, since the header now allows multiple rows, this should be using rowsToRows instead of rowToRow, which is no longer needed.

@tshort
Copy link

tshort commented Feb 27, 2021

@undergroundquizscene, after your recent update, it works great! It properly handled all the rowspans and colspans I've tried it with.

Comment on lines +279 to +259
rowsToRowspans :: [Row] -> [[(Int, Cell)]]
rowsToRowspans rows = let
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mb21 I want to unit test this function, but I can't figure out where to put the tests. I've tried test/Tests/Readers/Docx.hs, but I can't import the function in that file. Any idea how I can make that work?

@ebahdc
Copy link

ebahdc commented May 3, 2021

A status update:

I've completed the following new table features listed in #6316:

  • rowspans
  • colspans
  • multiple header lines
  • optional short caption
  • captions that allow block-level content
    • this may not currently handle fields (auto-numbered captions) correctly, I need to double-check that (and add a test for it)

I have not yet done:

  • row headers
    • I'm not sure what this means, I'll check
  • table head and foot
  • table attributes
    • as far as I can tell, tables in the docx format don't have identifiers, but I'll check what other attributes there are and see if they're represented in the docx format

@emdash-ie
Copy link
Contributor Author

A status update:

I've completed the following new table features listed in #6316:

* rowspans

* colspans

* multiple header lines

* optional short caption

* captions that allow block-level content
  
  * this may not currently handle fields (auto-numbered captions) correctly, I need to double-check that (and add a test for it)

I have not yet done:

* row headers
  
  * I'm not sure what this means, I'll check

* table head and foot

* table attributes
  
  * as far as I can tell, tables in the docx format don't have identifiers, but I'll check what other attributes there are and see if they're represented in the docx format

(Oops, this is my other account! Was signed in with the wrong one)

@tarleb
Copy link
Collaborator

tarleb commented May 4, 2021

This is great! @undergroundquizscene, do you have time to zoom about this some time? I'm working on the writer and would be curious about your input.

@emdash-ie
Copy link
Contributor Author

This is great! @undergroundquizscene, do you have time to zoom about this some time? I'm working on the writer and would be curious about your input.

Sure, let me email you directly and we’ll work out a time

Emily Bourke and others added 2 commits May 26, 2021 18:57
* Column spans
* Row spans
  - The spec says that if the `val` attribute is ommitted, its value
    should be assumed to be `continue`, and that its values are
    restricted to {`restart`, `continue`}. If the value has any other
    value, I think it seems reasonable to default it to `continue`. It
    might cause problems if the spec is extended in the future by adding
    a third possible value, in which case this would probably give
    incorrect behaviour, and wouldn't error.
* Allow multiple header rows
* Include table description in simple caption
  - The table description element is like alt text for a table (along
    with the table caption element). It seems like we should include
    this somewhere, but I’m not 100% sure how – I’m pairing it with the
    simple caption for the moment. (Should it maybe go in the block
    caption instead?)
* Detect table captions
  - Check for caption paragraph style /and/ either the simple or
    complex table field. This means the caption detection fails for
    captions which don’t contain a field, as in an example doc I added
    as a test. However, I think it’s better to be too conservative: a
    missed table caption will still show up as a paragraph next to the
    table, whereas if I incorrectly classify something else as a table
    caption it could cause havoc by pairing it up with a table it’s
    not at all related to, or dropping it entirely.
* Update tests and add new ones
I noticed this wasn’t done yet, and it was a simple change.

Note the column widths (which are doubles) are now displayed in the
native for the tests: is that ok? I wonder about rounding errors. Would
it be better to use a `Rational` here instead (especially since I’m
creating them by division)?
@emdash-ie emdash-ie marked this pull request as ready for review May 26, 2021 20:16
@emdash-ie
Copy link
Contributor Author

Ok, I believe this is ready to be merged. I've done what I listed in my previous message, with the exception that there's only partial support for block-level content within captions: putting an ordered list in the caption in Word creates a new paragraph, for example, which is then not included in the caption. However, I think it's an improvement over the current situation, so I think it's worth merging as-is while I work on the other parts.

Copy link
Collaborator

@tarleb tarleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work!

@jgm
Copy link
Owner

jgm commented May 27, 2021

We're still getting some CI failures here.

return (Just c)
[] -> return Nothing
let shortCaption = if T.null cap then Nothing else Just (toList (text cap))
cap' = caption shortCaption (fromMaybe mempty fullCaption)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to default to plain (text cap) instead of mempty for the full caption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that not duplicate the caption text (as it's going in the short caption as well)? I don't have the clearest idea of when the full caption or short caption is used.

@tarleb
Copy link
Collaborator

tarleb commented May 28, 2021

Any objections if I cherry-pick and amend the commits, so we can get this into the next release?

@emdash-ie
Copy link
Contributor Author

We're still getting some CI failures here.

I'll fix these now: I think it just requires updating the golden docx documents to match the test output.

@emdash-ie
Copy link
Contributor Author

Any objections if I cherry-pick and amend the commits, so we can get this into the next release?

No objections here, go for it :)

@tarleb
Copy link
Collaborator

tarleb commented May 28, 2021

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.
The idea about using Rational for column widths should probably be raised here: jgm/pandoc-types#86

@emdash-ie
Copy link
Contributor Author

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.

Sure, I'll take a look now.

@emdash-ie
Copy link
Contributor Author

Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1
I fixed the tests and renamed another such that it's used only by the writer tests. I also took the freedom to use the caption field for the long caption, simply because that's how this field was used before AFAIK. The commit message is altered slightly as to be more in line with the default style, but I left the explanation in one of them.
The idea about using Rational for column widths should probably be raised here: jgm/pandoc-types#86

Looks good! I actually can't see what you're saying about the caption field though: mind pointing me to the specific bit?

@tarleb
Copy link
Collaborator

tarleb commented May 28, 2021

Sorry, I forgot to push that change. It's here now: https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1#diff-22df782fded6b5f41521820cbf8de809ecd9835ce95f74fa2ca4eb49e0b0831dR675

@emdash-ie
Copy link
Contributor Author

@emdash-ie emdash-ie changed the title WIP: Docx reader - support new table features Docx reader - support new table features May 28, 2021
@emdash-ie
Copy link
Contributor Author

@tarleb Will I close this PR then? I plan to work on more of the docx table features over the next while, but it's probably easier to put them in a new PR rather than this one.

@tarleb
Copy link
Collaborator

tarleb commented May 28, 2021

I think so, too. For reference: the commits have been pushed as 44484d0 and 56b2111.

Thanks again!

@emdash-ie
Copy link
Contributor Author

I saw it automatically closed #6316: we'll probably want to re-open that as some of the table features are not yet supported.

@emdash-ie emdash-ie closed this May 28, 2021
@emdash-ie emdash-ie deleted the 6316-docx-reader-new-table-features branch May 28, 2021 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docx reader - support new table features
7 participants