Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crazy ideas: table structure #86

Open
tarleb opened this issue Nov 24, 2020 · 25 comments
Open

Crazy ideas: table structure #86

tarleb opened this issue Nov 24, 2020 · 25 comments

Comments

@tarleb
Copy link
Contributor

tarleb commented Nov 24, 2020

Here are a few thoughts I had while implementing table features in pandoc readers and writers. None of the suggestions have been thought through, so take the below with a grain of salt.

  • Promote RowHeadColumns from TableBody to the full table. The number of row heads will typically be constant for all table bodies; it is also relevant for, and should also apply to, the table head and foot.

  • Use grid-based data structure. Going from a grid structure to a list of rows and cells seems much simpler than the other way around. Most writers need to compute the table grid, so it might as well be the main structure. This could provide additional type-level guarantees and make it easier to access cells column-wise, e.g. when checking for the most frequent cell alignment in a column.
    A possible data structure would be Array from package array. It is already a transitive dependency of pandoc-types (through deepseq) and described in the Haskell 2010 report.

  • HTML limits rowspan to a max value of 65534 (= 2¹⁶ - 2), and it would be reasonable to adopt this limit. RowSpan could be then be a newtype wrapper for Word16.

  • The colspan attribute is limited to a max value of 1000 in HTML. Like for rowspan, this seems like a reasonable limit. ColSpan could also wrap Word16.

Cc: @despresc

@despresc
Copy link
Contributor

I'm not very familiar with the use of multi-body tables, but the examples I've seen do normally have constant-width row heads. So RowHeadColumns could be made to apply to the entire table. We could also enforce the boundaries of the row head in the table head and foot, since normally the overall head and foot of the table won't have cells that stretch across that boundary. It still shouldn't apply to the intermediate heads, as I discussed here.


Having the span types be a wrapper around Word16 is fine, though it will make working with them slightly more inconvenient, since the other table dimensions are still Int-based. It seems unlikely that anyone will want tables with spans larger than the HTML limits, so those exact limits could be enforced by the table builder too.


At the moment all the writers assume that they may encounter any possible Table, including ones that aren't normalized, since every Table is considered valid internally. If that's still a desirable assumption, then from perspective of the writers the extra cached information in Table itself would be useless: they couldn't trust it to be consistent. In that case having the grid table construction in a module in pandoc would be better.

The alternative is to give the Table type clear invariants that must not be violated, making normalized tables the only valid tables internally. If we did that, then we could cache as much information as we like in Table and have the writers simply trust that that information is accurate. This would save pandoc from having to normalize tables twice, at the cost of us having to be very careful not to construct ill-formed tables internally. The FromJSON and Peekable instances would need to normalize the tables that they read, for instance.

Other thoughts:

  • If the representation were switched to something like Array, then it would be convenient for the readers to have a function that constructs such a table from the current representation. Otherwise they would have to calculate the exact grid representation (including cell coordinates and occupied grid positions) themselves. That function and its inverse would probably also be the easiest way of ensuring a grid table is well-formed, by converting it to the current representation, normalizing it, then converting it back.

  • Traversals of tables might also be a bit slower, since they have to construct entire new arrays when old ones are modified. I'm not familiar with the performance characteristics of Array-like structures, so this might be avoidable, or might not be a problem in practice.

@despresc
Copy link
Contributor

On that topic, we could get rid of the double-normalization that currently happens by simply removing that behaviour from the table builder. The defensive normalization that the writers now do (through toLegacyTable and AnnotatedTable.toTable) would then ensure that the writer output is well-formed, which is the point of the normalization that's done now. That would mean that pandoc and external filters would encounter odd-looking internal tables more frequently than they do now, if that's a concern.

@sergiocorreia
Copy link

From a filter perspective, I like a lot the power of the new tables but also dread a bit the complexity for those implementing the filters.

For instance, it feels a bit inconsistent that, within a TableBody, you use the RowHeadColumns integers to indicate headers on the left but have to create a separate list of rows to indicate headers at the top.

When walking over a tree, the structure of a table is also quite different from those of other elements. With most elements, you just traverse the children nodes directly, and each element only has one type of children. However, with tables, you need to traverse the main tree (with multiple TableBody elements) and also need to traverse the head and the foot of the table. Maybe it would be simpler to say that a Table can contain three types of elements, TableHead, TableBody, and TableFoot, with the constraint that TableHead must be the first element and TableFoot must be the last element?

With these two changes (adding a HeadRows to TableBody, and moving TableHead TableBody and TableFoot to the same list), a Table would be much easier to iterate, as every table would just have three types of children, which in turn will have lists of Rows, which in turn just have Cells.

(I'm not necessarily saying we should go that way, just thinking aloud in line with the title of this issue :) )

@ickc
Copy link
Contributor

ickc commented Nov 25, 2020

I'll think out loud for a moment, with the caveat that I don't know much about the characteristics in Haskell programming and am thinking more in the perspective of pandoc filter writer, basically mainly from my pet project pantable written in Python.

First, the new table AST is a great thing to happen. It is long overdue. But at the same time it is a very disruptive change. I think any filter involved in manipulating pandoc tables are broken by this. In the case of pantable, I took advantage of this change to rewrite it completely as its new features really worth the efforts.

But what I'm saying is may be people writing filter for pandoc table are still playing catch-up to this new AST, a yet another unnecessary change may be unwelcomed.

Eg the proposed RowHeadColumns would make it becoming more restrictive. Not that the restriction is that restricting in practice, but it is still a regression. (But I understand the point about head and foot.)

About the grid data structure, I think the current AST structure that's more HTML like is fine. I think having a grid structure as an alternative would be helpful. But I think replacing it by that would be an unnecessary change. In fact, in the work in progress pantable, the internal structure is exactly a grid and it basically provide a way you can jump losslessly between it (grid) and panflute AST (basically pandoc like AST).

So what I'm saying is probably you should have both, and lazily convert between both on demand within Haskell. (This is speaking from some OOP thinking if that makes sense... Eg if the reader already lay things on a grid, then it has a grid representation already, if the writer requires a grid, then no conversion, else convert and perhaps normalized once here...) For JSON filter, may be it is good to have one single representation only for consistency, and let the filter framework do our own grid and normalization (which is what I'm doing in pantable). Or may be you can allow both and let the filter user choose (Eg an arg from pandoc to control this.)

@jgm
Copy link
Owner

jgm commented Nov 28, 2020

Just catching up on this thread.

I suppose that since we're already in a period of churn with the table types, further changes might be less disruptive now than if we waited til everything worked again. Especially if the changes made it easier to process tables.

I'm curious what the grid representation of tables would look like in a Haskell type?

The suggestion to treat row headers like column headers (using an Int) seems a great one. @despresc - are there reasons not to do it this way?

@despresc
Copy link
Contributor

Do you mean indicating the number of rows in the head of a TableBody with an Int? It seemed like separating the head rows from the body rows would be clearer, since they're different kinds of rows. It's like representing a pair of lists as ([a], [a]) versus (Int, [a]). I can't think of other reasons, so if it's clearer or easier to work with in practice then the Int way is fine.

@jgm
Copy link
Owner

jgm commented Nov 29, 2020

I'm not sure how much easier it will make things. We should think about that before making a change such as this (which would require a lot of changes throughout the code base). Maybe @sergiocorreia could elaborate on that. But it's hard to see a good rationale for using two lists for row headers but using an Int for column headers.

@sergiocorreia
Copy link

Do you mean indicating the number of rows in the head of a TableBody with an Int?

Yes! Some arguments in favor are:

  1. TableHead, TableBody, and TableFoot would then be very similar to each other, with the only difference being two extra properties in TableBody (num_head_columns, num_head_rows, or however we name them).

This then reduces the mental overhead of manipulating tables. For instance, converting a TableFoot to a TableBody would be trivial.

  1. Headers within a TableBody are rare enough that most users would then be able to ignore them unless they need to use them, and even then they would just need to specify an integer.

  2. Even if they are different types of rows, in most document types it's not the rows that are treated differently but the cells (e.g. in HTML they are all <tr> objects). This would also simplify writers as there would only be one way to deal with header column/rows (think of the HTML writer, where header rows and columns are saved as <th> instead of <td>).

On the other hand, this might make it a bit harder to create filters that apply to table body heads. Right now, in panflute this would involve:

def action(e, doc):
    if isinstance(e, TableRow) and isinstance(e.parent, TableBody) and e.location=='head':
        pass # do something

With the change, the code would be:

def action(e, doc):
    if isinstance(e, TableRow) and isinstance(e.parent, TableBody) and (e.index < e.parent.num_head_rows):
        pass # do something

@ickc
Copy link
Contributor

ickc commented Nov 29, 2020

Just trying to talk about the few things that is difficult to handle in the current AST, with the mindset of implementing a one-one correspondence of the pandoc table AST to a grid structure in pantable. (Below I used a terminology to distinguishing things and it would be great if there's an official terminology associated with them. Whatever is in double-quote is my made-up terminology. ICA is (Id, class, attr). I called the current AST "tree-like representation" and the proposed one "grid-like representation".)

  • "row-block" related: this gives the most pain
    • in rows, we have head, body, foot, and within body, we have another head. So there's 4 kinds of "row-blocks", head, "head of body", "body of body", foot.
    • only head, body, foot has "row-blocks ICA", i.e. only 3 kinds of "row-blocks" are discerned here, not "head of body", "body of body"
    • only body has RowHeadColumns
  • align: there are 2 places for allignment, in the "spec" where the column alignment is with the col-width. And in the cell.
  • caption: short-caption only accept Inline and caption accepts Block.
  • a lot of different places for ICA, per cell, per row, per row-block, per table
  • row and columns are fundamentally asymetric here. For example if you transpose the body, if RowHeadColumns are different between different bodies, it is unsure what its transpose should be. (The same can be said if there's more than a head and a "body of body" however, so a moot point.)

From my experience of implementing a grid representation of the pandoc table AST in Python, while the above things are annoying, it becomes easy once the grid structure is set in-place. I hold the view that there's no best representation here. (Especially with span, because without span and tree representation is the grid representation.) The default tree representation in pandoc is more suitable if the input/output formats are also tree like, such as HTML and LaTeX. (My guess is that the HTML and LaTeX writers first supporting table span is not a coincidence, but because it is a more natural translation.)

That's why I recommend you have 2 representations, a grid one and a tree one. For some intput/output, if the grid one is easier to work with, put it in a grid representation. The grid representation and the tree representation should be one-one correspondance so there's no problem in translating back and forth.

And for performance, I think it is debatable if a grid-only approach would be faster. I would expect for some input-output pairs such as the HTML the tree representation is more natural, and hence fastest to work with them alone. Also I am not sure translating the document into a grid structure is (significantly) faster than put it in a tree than a grid.

About RowHeadColumns:

  • even if there's no RowHeadColumns per body, a body cannot be easily converted to a head or foot, because a body has a concept of "head of body" and "body of body".
  • I'd argue if you want to redefine where are the "row-blocks", it is much easier to put it on a grid first and work from there. At least in the case of pantable it is trivial to do that.
  • replacing RowHeadColumns per body to one single no. is more restrictive. It is hard to convince why that is a necessary change and has no obvious advantage.

@kysko
Copy link

kysko commented Nov 30, 2020

Just made aware of this discussion by an allusion in pandoc-discuss... Don't know Haskell, but here's a few comments... and other crazy ideas!

sergiocorreia:

Maybe it would be simpler to say that a Table can contain three types of elements, TableHead, TableBody, and TableFoot, with the constraint that TableHead must be the first element and TableFoot must be the last element?

Been working/playing on and off on a Lua port of docutils' grid_table parser in the past few weeks for a Lua filter, I can echo this impression.

In fact, I'd go a bit further and say they could be viewed of the same TableBody kind, which can have a (sub)head and body. The TableHead would be special in being the first with a (sub)head but no body (and no rhc); the TableFoot could be special in being the last, with a (sub)head with a single empty row (not nil), and a body (and no rhc).
Better yet, since TableBody has an ICA (to take ickc's terminology), the information could be placed as an attribute rather than playing with (sub)head as above.

[In other words (trying to be clearer), if a Table has no TableHead nor TableFoot, you can put enough information in the first TableBody and last TableBody so that a writer -- or intermediate filter -- could know they can be used as head and foot.]

Of course it's a bit of a hack to think this way, and perhaps there's a need for a more definite identity (and restrictions) for head and foot, but it was easier for intermediate solutions before translating to pandoc internal structure.

tarleb:

Use grid-based data structure. Going from a grid structure to a list of rows and cells seems much simpler than the other way around.

Since all cells now have a ICA, wouldn't it be sufficient that pandoc write the "coordinates" in cells' attributes in the current scheme?
That way a writer could have some clue where a cell's column really is, and discard those attributes after their purpose. Again a bit of a hack of course...

@ickc
Copy link
Contributor

ickc commented Nov 30, 2020

Just made aware of this discussion by an allusion in pandoc-discuss...

I think we actually should discuss this on pandoc-discuss first, especially change like this is not about adding a feature, but how best to represent table. Signal-to-noise might be lower in pandoc-discuss, but since this change is going to affect a lot of people with no obvious gain it might be better discussed there first. At the very least we can have a community-wise consensus on the last change that is needed regarding the table.

In fact, I'd go a bit further and say they could be viewed of the same TableBody kind

In pantable, by putting the cells on a grid, I don't distinguish the head, "head of body", "body of body", foot in the table itself. I keep a list of int like this: [1, 0, 9, 0], which indicates there's 1 row of head, 0 row of "head of body", 9 rows of "body of body", 0 row of foot. It will get longer as there's multiple bodies. This approach does have its quirks as inferring which "row-block" a row belongs, or the "n-th body" it belongs to, will be harder. But they are doable (just crunching numbers.)

wouldn't it be sufficient that pandoc write the "coordinates" in cells' attributes in the current scheme?

Going back and forth between table and cells will be difficult here. To find a cell you need to walk the whole table in this case. e.g. normalizing tables probably need to do this. So a grid representation is still advantageous in some operations.


One of the premise in this issue is that "Going from a grid structure to a list of rows and cells seems much simpler than the other way around." But I think you probably just pushed "going from 'tree-representation' to 'grid-representation'" earlier in the stage of reader. In cases that the reading format is also "tree-like", such as HTML/LaTeX, may be there's no gain?

@kysko
Copy link

kysko commented Dec 1, 2020

we actually should discuss this on pandoc-discuss first

ook, was just surprised about the news. I'll keep the 'noise' to a minimum 😉.

Going back and forth between table and cells will be difficult here. To find a cell you need to walk the whole table in this case.

True, was not thinking about going "back and forth", only one way.
In any case... a quick look at AnnotatedTable show that it seems to offer some kind of grid mapping (and more), at least if I interpret the code comments correctly! I should have checked before my initial suggestion.

@ickc
Copy link
Contributor

ickc commented Dec 2, 2020

I don't mean that, I mean the topic of this issue might better be discussed in pandoc-discuss first, I don't mean your comment in particular. I meant if we discussed there the signal to noise might be lower but it probably can get more attention from people that would be affected by this kind of change most. It's a bit unusual, but AST change as big as this is unusual. And if we have to change it a 2nd time then we better discuss it more and nailed it.

@tarleb
Copy link
Contributor Author

tarleb commented Dec 14, 2020

I'm curious what the grid representation of tables would look like in a Haskell type?

This is roughly what I imagine:

-- | A grid cell contains either a real table cell, or is the
-- continuation of a column or row-spanning cell. In the latter case,
-- the index of the continued cell is provided.
data GridCell
  = ContentCell Attr Alignment RowSpan ColSpan [Block]
  | ContinuationCell RowIndex ColIndex

newtype RowIndex = RowIndex Int deriving (Eq, Ix, Ord)
newtype ColIndex = ColIndex Int deriving (Eq, Ix, Ord)

-- | Cells are placed on a grid. Row attributes are stored in a separate array.
data CellGrid = CellGrid
  { cellArray :: Array (RowIndex,ColIndex) GridCell
  , rowAttrs  :: Array RowIndex Attr
  }

data GridTable = GridTable
  { tableAttr     :: Attr
  , tableCaption  :: Caption
  , tableColSpecs :: Array ColIndex ColSpec
  , tableHead     :: CellGrid
  , tableBodies   :: [CellGrid]
  , tableFoot     :: CellGrid
  , tableRowHeads :: Int
  }

Or the table could be one large grid, and the different parts would be given as row index ranges:

data GridTable = GridTable
  { tableAttr     :: Attr
  , tableCaption  :: Caption
  , tableColSpecs :: Array ColIndex ColSpec
  , tableGrid     :: Array (RowIndex, ColIndex) GridCell
  , tableHead     :: RowRange
  , tableBodies   :: [(RowRange, RowRange)] -- (body header, real body)
  , tableFoot     :: RowRange
  , tableRowHeads :: ColRange
  }

@tarleb
Copy link
Contributor Author

tarleb commented Apr 27, 2021

FWIW, the grid representation would be really useful to have for docx tables, where rowspans are handled by adding "merge cells" in subsequent rows.

@jgm
Copy link
Owner

jgm commented Apr 27, 2021

Can you remind me: is the idea to change the main table type to a grid representation, or to add a GridTable type and conversion functions from/to the Table we have now? (In the latter case, these wouldn't necessarily need to go in pandoc-types; they could go in pandoc itself.)

@tarleb
Copy link
Contributor Author

tarleb commented Apr 28, 2021

The original idea was to update the main table type, but it's probably too late for that now. Maybe it's just my way of thinking, but I realized that I find grids the most natural way to think about tables.

I may try to write something akin to a GridTable for the docx writer, or maybe modify the AnnotatedTable, and we can see where that leads us.

@jgm
Copy link
Owner

jgm commented Apr 28, 2021

I wouldn't rule out changing the table type again, if the grid model is much easier to deal with. But that's going to be a lot of work. Maybe a good first step would be to implement the type in an auxiliary model in pandoc, together with conversion functions.

@ickc
Copy link
Contributor

ickc commented Apr 30, 2021

Grid model has its own problems but I agree it seems easier to work with. I agree with @jgm in the direction of

  1. create a grid-table in addition to the current tree-table
  2. has lossless functions converting between them (lossless in valid data only)
  3. then in the long run see which one you uses more, and if needed, change the pandoc-type to grid-table and rely on those lossless functions for those that needs tree-table
    • this probably can be done after a long time since the new table AST, but it'd be great if a tool such as ast-migrator can deal with this lossless conversion in the filter level. (Or better yet, a flag that change how pandoc writes the AST, i.e. you have both AST representation of the same thing that can be losslessly converted to one another. e.g. the filter can call pandoc to convert from what it receives to the AST that it understands.)

@tarleb
Copy link
Contributor Author

tarleb commented May 1, 2021

I have got to admit that I lost a good bit of enthusiasm for grids after using them for the docx writer. I still think they are the right data structure for that specific case, but I forgot just how wrong it feels to use unsafe operations like array indexing in Haskell. We'd have to provide a number of good helper functions to safely use arrays throughout the code base. I currently tend towards using grids as intermediate structures only.

@jgm
Copy link
Owner

jgm commented May 3, 2021

We should completely avoid unsafe partial functions like (!!) in the code -- though it's easy enough to create safe wrappers as is done in the safe package.

@emdash-ie
Copy link

Smaller point that came to mind while I was implementing column width reading in the docx reader: would it be better to have the column widths as Rational rather than Double? At least in the docx reader I was creating them by division, and it looked a bit odd having the full doubles in the native. I guess any rounding errors are going to be quite small, so maybe it doesn't matter?

@jgm
Copy link
Owner

jgm commented May 28, 2021

I wish I'd used Rational rather than Double from the beginning. That is indeed a better choice. But maybe not worth fixing now, until the next major revision of pandoc-types.

@ickc
Copy link
Contributor

ickc commented May 30, 2021

I agree rational is a good choice. But what would that becomes in JSON AST? A float or specially formatted string?

[pantable (pandoc table ast in Python) allows rational ratio optionally. Also the auto width calculated from say a markdown table is already rational.]

@jgm
Copy link
Owner

jgm commented May 30, 2021

I suppose the JSON representation could just be a string, yes. That might make it a bit more difficult for consumers of the AST other than pandoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants