Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML reader - support new table features #6312

Closed
jgm opened this issue Apr 24, 2020 · 13 comments
Closed

HTML reader - support new table features #6312

jgm opened this issue Apr 24, 2020 · 13 comments
Assignees

Comments

@jgm
Copy link
Owner

jgm commented Apr 24, 2020

Add support for new table features introduced in
jgm/pandoc-types#66
including table attributes (including identifier), rowspan, colspan, table head and foot, multiple header lines, row headers, captions that allow block-level content and include an optional short caption.

@danlobo02
Copy link

subscribing

@jgm
Copy link
Owner Author

jgm commented Aug 24, 2020

@danlobo02 - note that you can subscribe to issues using the Subscribe button on the right; you don't need to make a comment.

jgm pushed a commit that referenced this issue Sep 10, 2020
* HTML writer: add support for row headers, colspans, rowspans
* Add planet table tests

See #6312
@tarleb tarleb self-assigned this Nov 23, 2020
@tarleb
Copy link
Collaborator

tarleb commented Nov 23, 2020

I'd be happy to get some opinions on a problem I'm facing: Cell alignment cannot (or should not) be specified for all cells in a column, as the align attribute has been deprecated for <col> elements. So each cell must specify its own alignment. However, other formats don't support per-cell alignments, and expect column-wise alignment info. Should we try to detect column alignments in the HTML reader?

Put differently, given this table

<table>
  <tr><td style="text-align: left">1</td><td style="text-align: right">2</td></tr>
  <tr><td style="text-align: left">3</td><td style="text-align: right">4</td></tr>
</table>

What should the Markdown output be?

  --- ---
  1     2
  3     4
  --- ---

or

  --- ---
  1   2
  3   4
  --- ---

I lean towards the first version, as that is also what pandoc produces currently. But I believe that, in the presence of colspans and rowspans, the first version will require table normalization similar to what we do in the HTML writer, which might impact performance.

Also, where should this computation happen? In the writer, so filters have access to the original information? Or in the reader, because then we can provide more guarantees for the well-formedness of tables?

@jgm
Copy link
Owner Author

jgm commented Nov 23, 2020

Yes, I agree, the first version.
By normalization, do you mean moving from a representation where the alignments are on each cell to one where the cell alignments are default and an alignment is specified on the column?
It seems to me that it would make most sense to do this in the reader, though I'm open to other ideas.

@jgm
Copy link
Owner Author

jgm commented Nov 23, 2020

@despresc may also have feedback on this.

@ickc
Copy link
Contributor

ickc commented Nov 23, 2020

If I understand this question correctly, I think the reader should try its best to represent the input in the pandoc AST, and in this case means it is representing the align per cell, since the AST is capable to do that.

I don’t know much about the normalization already performed in pandoc, in this particular case, is it just checking if the whole column is sharing the same align and if it did then “normalize” it such that there’s no cell alignment but one single column alignment for that column? (And repeat for each column.) I think this kind of normalization probably shouldn’t hurt the performance a lot.

By the way I’m working on a library and pandoc filters that essentially can read and write all pandoc table AST feature so my biased preference would be an AST that has as much information from the reader as possible.

@despresc
Copy link
Contributor

despresc commented Nov 24, 2020

Setting the column alignment like that seems fine to me. If that is done here, it should probably be after the table builder has done its work. Otherwise it could be done in the table builder itself, at the cost of bumping the pandoc-types version and introducing a little unnecessary work in the readers of other formats.

Since AlignDefault is supposed to mean "no alignment preference", I think, should the rule be relaxed to include cases where a column has a mix of AlignDefault cells and cells of some common explicit alignment?

I assume that cells that span multiple columns will not influence this process at all, since their alignment tends to be unrelated to the alignment of the columns they span.

It will be slightly easier to keep the alignments on cells in the affected columns than to remove them, since the entire column needs to be laid out before we can decide whether or not its cells share the same alignment. The writers for formats that have both column and cell alignments already have to deal with superfluous explicit cell alignments, so keeping the alignments shouldn't introduce problems that aren't already present.

@tarleb
Copy link
Collaborator

tarleb commented Nov 24, 2020

Thanks for the feedback! I preserved the old behavior for now (i.e., take column-wide alignments from cells in first row), but plan to improve that later.

@tarleb
Copy link
Collaborator

tarleb commented Nov 26, 2020

Implemented:

  • Multi-row heads
  • Multi-body tables
  • Proper table footers
  • Attributes on thead, tbody, tfoot, and table cells

Missing:

  • support for body header rows
  • support for row header cells
  • attributes on tr elements

@tarleb tarleb closed this as completed in a9c7662 Nov 27, 2020
@tarleb
Copy link
Collaborator

tarleb commented Nov 27, 2020

Done. One final question: if there is no <thead> element, then the reader used the first body row as header iff it contained only <th> cells. I've kept this behavior for backwards compatibility, but maybe it would better to make a clean slate and do away with special cases?

@jgm
Copy link
Owner Author

jgm commented Nov 27, 2020

I think it's good to keep backwards compatibility here -- at least until we have a specific reason to do otherwise.

@jgm
Copy link
Owner Author

jgm commented Nov 27, 2020

Do we have good test coverage for all these features?

@tarleb
Copy link
Collaborator

tarleb commented Nov 27, 2020

Most features are tested with what's already there in other tests, mostly in test/html-reader.{native,html}. The only exceptions are colspans/rowspans, which I wanted to add to these files but forgot. I will do so later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants