-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docx reader - support new table features #6512
Docx reader - support new table features #6512
Conversation
data VMerge = Continue | ||
-- ^ This cell should be merged with the one above it | ||
| Restart | ||
-- ^ This cell should not be merged with the one above it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no clue about the docx format's way or representing table spans, but I assume you checked whether you really need these intermediate data structures and cannot convert directly to the pandoc AST table type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could probably do without the intermediate data structures, if that'd be preferred – I had decided that it made sense to have an intermediate representation that captures the structure of the docx format, to separate reading the XML from converting from the format into the pandoc AST types.
For example, for the VMerge
type here, the xml may contain:
- no
vMerge
child, in which case the cell shouldn't be merged with the one above - a
vMerge
child with noval
attribute, in which case the cell should be merged - a
vMerge
child with aval
attribute of eithercontinue
orrestart
, meaning merged and not merged respectively
I figured it made sense to separate these XML rules from the conversion of the vMerge data into the pandoc AST type (which I haven't done yet, but I'm assuming will require looking first at the GridSpan
values to group the cells into columns, followed by an iteration down each column to work out the row spans).
I can see that it could be adding unnecessary code though, so I'm happy to change it. What do you think?
(I'm following the 5th edition of ECMA-376 for the format. Let me know if there's another reference I should be using)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't really looked at the vMerge attributes.. but yes, if it's sufficiently different from rowspan/colspan semantics (which the pandoc AST uses), then perhaps it makes sense to use an intermediate data structure... up to you I guess :) or maybe @jkr has an opinion?
about the word format: we try to be quite backwards-compatible... and usually we rely as much on testing files that were written by Word as the spec, as the two unfortunately are not always in sync :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll see how it looks as I go and keep both approaches in mind :)
Ah, that makes sense! I'll try and check that I'm keeping things backwards-compatible. Will I need to add any new tests, or should existing tests cover it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any docx reader tests that cover row- or col-spans currently... so adding a few would certainly be welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, I will add some.
@mb21 Can I ask some advice? For table captions inserted using Word, there doesn't seem to be an explicit link from each caption to the table it describes – a caption is just a paragraph with a table number field, and references to the table are actually references to this field. The paragraph is automatically inserted next to the table (either before or after it), but can be moved around afterwards. It's also possible to insert multiple captions against the same table, which get sequential numbers – here's a screenshot to show what I mean: I'm wondering what the best way to link up the captions and tables is. I think the best option would be to link them up sequentially, so the first table caption in the document with the first table in the document and so on, though this would involve dropping captions which reference tables that don't exist (e.g. for the document in the screenshot we'd keep "Table 1, which has stuff in it" and drop the other two captions). What do you think? |
we usually try to do whatever makes most sense in regards to most documents found "in the wild", but I'm not sure what's that in this case... do you have a few sample docs lying around, maybe from coworkers? Another consideration is that round-tripping should work: i.e. from pandoc native to word and back, but since #6315 is still open, we cannot test this. |
That sounds good, I'll have a look and see what I can find in general use. I'll keep round-tripping in mind too. I think the approach I'm leaning towards should be fine, since it isn't possible to have a table caption without a table in the pandoc native format. |
Looking forward to this PR @undergroundquizscene - thanks! Two other things to keep in mind:
|
No problem! I'll get there eventually 😅
Does this functionality exist in Word? From my experimentation, it seems like the internal links are actually fundamentally links to the captions, so I have a feeling if there's no caption there's no link in the source Word document. I'll test this and see.
I'm not certain what you mean by "filters" here – do you just mean the auto-numbering of tables/table captions in general? |
@undergroundquizscene, I tried this branch out. It's nice to be able to convert from MS-Word to HTML now and preserve colspans! I'm really looking forward to this. One issue I ran into is that for some tables, I lost some of contents of the cells on the bottom of the table. I can send an example if that helps. |
Thanks for letting me know! I’ve been busy recently, but I’m hoping to make some more progress on this shortly. I think I had noticed what you’re describing at some point, but please do send on an example, it’ll be a big help. |
src/Text/Pandoc/Readers/Docx.hs
Outdated
rowsToRows rows = do | ||
let rowspans = (fmap . fmap) RowSpan (Docx.rowsToRowspans rows) | ||
let rowCells = fmap (\(Docx.Row _ cs) -> cs) rows | ||
cells <- zipWithM (zipWithM cellToCell) rowspans rowCells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a trace
function here, and I think rowspans
and rowCells
have different lengths, so that's why cells at the end are coming up empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, absolutely! Sorry, I didn't see this, just came back here to point that out. The problem is my rowsToRowspans
function, which uses transpose
to look at the columns and then transposes
again to get back to rows: unfortunately that only works when every row has the same number of elements, which is not true here. I'll write a transpose
that takes the column spans into account I reckon, that should sort that.
calculateColspan cell@(Cell _ vMerge _) (column, continuesBelow) = | ||
case vMerge of | ||
Restart -> ((cell, 1 + continuesBelow) : column, 0) | ||
Continue -> (column, 1 + continuesBelow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused on the naming here. When you have Restart
and Continue
, it seems like those are for rowspan
s, but this is named calculateColspan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the name is incorrect. When I wrote it I had colspans and rowspans backwards in my head, and when I renamed everything I missed the names inside this function.
src/Text/Pandoc/Readers/Docx.hs
Outdated
-- pad cells. New Text.Pandoc.Builder will do that for us, | ||
-- so this is for compatibility while we switch over. | ||
let cells' = map (\row -> toRow $ take width (row ++ repeat mempty)) cells | ||
hdrCells <- fmap (>>= toHeaderRow) (traverse rowToRow hdr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshort Also, since the header now allows multiple rows, this should be using rowsToRows
instead of rowToRow
, which is no longer needed.
@undergroundquizscene, after your recent update, it works great! It properly handled all the rowspans and colspans I've tried it with. |
rowsToRowspans :: [Row] -> [[(Int, Cell)]] | ||
rowsToRowspans rows = let |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mb21 I want to unit test this function, but I can't figure out where to put the tests. I've tried test/Tests/Readers/Docx.hs
, but I can't import the function in that file. Any idea how I can make that work?
A status update: I've completed the following new table features listed in #6316:
I have not yet done:
|
(Oops, this is my other account! Was signed in with the wrong one) |
This is great! @undergroundquizscene, do you have time to zoom about this some time? I'm working on the writer and would be curious about your input. |
Sure, let me email you directly and we’ll work out a time |
* Column spans * Row spans - The spec says that if the `val` attribute is ommitted, its value should be assumed to be `continue`, and that its values are restricted to {`restart`, `continue`}. If the value has any other value, I think it seems reasonable to default it to `continue`. It might cause problems if the spec is extended in the future by adding a third possible value, in which case this would probably give incorrect behaviour, and wouldn't error. * Allow multiple header rows * Include table description in simple caption - The table description element is like alt text for a table (along with the table caption element). It seems like we should include this somewhere, but I’m not 100% sure how – I’m pairing it with the simple caption for the moment. (Should it maybe go in the block caption instead?) * Detect table captions - Check for caption paragraph style /and/ either the simple or complex table field. This means the caption detection fails for captions which don’t contain a field, as in an example doc I added as a test. However, I think it’s better to be too conservative: a missed table caption will still show up as a paragraph next to the table, whereas if I incorrectly classify something else as a table caption it could cause havoc by pairing it up with a table it’s not at all related to, or dropping it entirely. * Update tests and add new ones
I noticed this wasn’t done yet, and it was a simple change. Note the column widths (which are doubles) are now displayed in the native for the tests: is that ok? I wonder about rounding errors. Would it be better to use a `Rational` here instead (especially since I’m creating them by division)?
Ok, I believe this is ready to be merged. I've done what I listed in my previous message, with the exception that there's only partial support for block-level content within captions: putting an ordered list in the caption in Word creates a new paragraph, for example, which is then not included in the caption. However, I think it's an improvement over the current situation, so I think it's worth merging as-is while I work on the other parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful work!
We're still getting some CI failures here. |
return (Just c) | ||
[] -> return Nothing | ||
let shortCaption = if T.null cap then Nothing else Just (toList (text cap)) | ||
cap' = caption shortCaption (fromMaybe mempty fullCaption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to default to plain (text cap)
instead of mempty
for the full caption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that not duplicate the caption text (as it's going in the short caption as well)? I don't have the clearest idea of when the full caption or short caption is used.
Any objections if I cherry-pick and amend the commits, so we can get this into the next release? |
I'll fix these now: I think it just requires updating the golden docx documents to match the test output. |
No objections here, go for it :) |
Could you check that this is ok? https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1 |
Sure, I'll take a look now. |
Looks good! I actually can't see what you're saying about the caption field though: mind pointing me to the specific bit? |
Sorry, I forgot to push that change. It's here now: https://github.com/jgm/pandoc/compare/master...tarleb:docx-reader-new-table-features?expand=1#diff-22df782fded6b5f41521820cbf8de809ecd9835ce95f74fa2ca4eb49e0b0831dR675 |
Looks good, thanks :) |
@tarleb Will I close this PR then? I plan to work on more of the docx table features over the next while, but it's probably easier to put them in a new PR rather than this one. |
I saw it automatically closed #6316: we'll probably want to re-open that as some of the table features are not yet supported. |
I haven't done many of the features yet, but I figure it's good to get something up early to show what I'm doing (and in case I'm heading in wildly the wrong direction).
Will close #6316