Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODT and figure/table numbering #5474

Closed
lierdakil opened this issue May 1, 2019 · 25 comments
Closed

ODT and figure/table numbering #5474

lierdakil opened this issue May 1, 2019 · 25 comments

Comments

@lierdakil
Copy link
Contributor

ODT now apparently forcibly adds "Figure " and "Table " to figures and tables. While cool and all, this is an issue for me. For one, there seems to be no way to disable this behaviour (or am I missing something?). For two, as far as I can tell, no other format except LaTeX does this, which seems to go against the "write once export everywhere" idea. For three, pandoc-crossref handles numbering and cross-referencing arguably better, but having this thingamajig in ODT breaks everything (because pandoc-crossref can't know about what ODT writer does, and vice versa)

So, I have a couple questions to ask.

First: Why is it there in the first place, considering it's virtually nowhere else. Frankly feels tacked-on and weirdly inconsistent with the rest of pandoc.
Second: Do you suppose a way to explicitly enable/disable the feature would be in order?
Third: Do you think this option should be disabled or enabled by default? I strongly lean to "disabled by default" for the sake of consistency with docx/html/etc.

Thanks.

P.S. I created a thread on pandoc-discuss a few days ago, but not much's going on there.

P.P.S. Side note, I've run into another weird inconsistency with ODT writer wrt figures in table cells -- there just aren't any. If a table cell contains only a Para, it's unconditionally turned into a paragraph, even if should by all accounts be a figure. Should I create another issue/pr for that?

@jgm
Copy link
Owner

jgm commented May 1, 2019

This was done in #4944 -- see that issue for motivation. cc'ing @pyssling

I agree that there's an issue about consistency; this could be handled, of course, by adding the translated Figure or Table names and numbers in the other output formats as well (e.g., HTML). Doing that would make output more consistent between PDF and the other output formats.

The bigger issue is the bad interaction with pandoc-crossref, which I unfortunately hadn't considered in merging that PR. Although ultimately I'd like to have good support for cross-references in pandoc itself, this addition doesn't give you enough for cross-refs in ODT, so pandoc-crossref is still needed and can't be used because of this change. I agree that this is a major issue.

Some possible options:

  1. Add syntax extensions number_figures and number_tables (or maybe number_figures_and_tables) that turn this behavior on in output formats that support it.

  2. Add a command-line flag that does this, maybe --number-figures-and-tables?

I think (2) is more consistent with existing behavior (e.g. --atx-headers option, or --toc).

@jgm jgm added this to the 2.7.3 milestone May 1, 2019
@lierdakil
Copy link
Contributor Author

IIRC, syntax extensions almost exclusively relate to Markdown syntax, so unless you want to somehow rework those to be reader/writer-dependent, I don't think using syntax extensions makes that much sense. That said, after a quick look at the code base, it seems that syntax extensions are slowly creeping into other readers/writers, so this might be a good idea anyway in the long run -- not entirely sure about that.

(2) is certainly more straightforward.

@pyssling
Copy link
Contributor

pyssling commented May 1, 2019

To answer the First question:
The consistency I was trying to achieve was with what Libreoffice itself generates for the user when inserting figures/tables. This isn't just stylistic unfortunately, but also what Libreoffice uses when parsing the document to create 'Table of Figures' and 'Table of Tables'. Without the numbering it simply ignores figures and tables when generating these indexes.

Hope that clears up what I was trying to accomplish.

I wouldn't mind having options to control these, but it's good to be aware that not enabling them breaks Libreoffice expectations, which should be documented for poor users (of which I was once one) who can't figure out why Libreoffice doesn't include their tables and figures in its indexes.

I also had a quick look at pandoc-crossref, and it's worth mentioning that it is very much latex aware. Making it ODT aware might not be that bad?

@scottmartincampbell
Copy link

A comment on this issue: it would be helpful if adding "Figure X" was optional for ODT, as with the (2) suggestion above. In particular, org-mode already includes "Figure X" when exporting to ODT, so I end up with "Figure 1: Figure 1: [caption]", and I can't currently see a way to get pandoc or orgmode to not do this, other than downgrade pandoc.

@jgm
Copy link
Owner

jgm commented May 2, 2019

Maybe better to have two options, e.g. --figure-numbers and --table-numbers?

@lierdakil
Copy link
Contributor Author

@pyssling

create 'Table of Figures' and 'Table of Tables'
not enabling them breaks Libreoffice expectations

IMO, that's not really a reasonable expectation that whatever pandoc writes into any output format would play all that nice with that format's internal references (which list of figures/tables basically is). I mean, I don't mind hacking those in, but I really don't think that's pandoc's "primary mission" so to speak.

I will concede that LaTeX is a bit of a special case, sure. But LaTeX is 100% hackable via custom pandoc templates if needed. This is not the case with ODT, so my first expectation is to get the least-embellished human-readable output possible while keeping semantic information. At least by default.

pandoc-crossref [...] it is very much latex aware.

Actually, I've been trying to get away from that. Turns out, reimplementing non-negligible parts of pandoc's LaTeX writer in a filter is not a sustainable strategy in the long run. Basically currently pandoc-crossref is two filters disguised as one; "LaTeX mode" can't do everything "normal mode" can, but in some particular cases it, conversely, works better than "normal mode". It's all rather painful in practice, one basically has to prepare two different documents, one for LaTeX and another for everything else. The next major release won't have much of that mess left.

Making it ODT aware

For one, see above: hacking direct format support into a filter equals reimplementing writers. Not really sustainable.

For two, while I'm reasonably proficient in LaTeX to hack-in some raw LaTeX blocks where needed, I can't say I could do the same for XML-based formats like ODT or docx without spending weeks untangling their specs. Furthermore, IIRC, counters and especially references in those formats (well, certainly docx) are considerably more complicated than could be achieved with simply inserting raw blocks (which is the limit of what a filter can do).

So, no, unless someone else is willing to write and maintain all the ODT-specific code in pandoc-crossref, this is not going to happen, even if technically feasible, which I'm not sure it is.

@lierdakil
Copy link
Contributor Author

lierdakil commented May 2, 2019

@jgm

Maybe better to have two options, e.g. --figure-numbers and --table-numbers?

As a reminder, we also have code blocks. Which one could also reasonably want to be numbered. So this potentially turns into at least three separate options already. And if we also at some point decide to treat display math in a similar way, this all gets a bit overwhelming, don't you think?

An idea that comes to mind is to have the ability to specify which counters to enable in writer as arguments to the option, instead of having multiple options. For example (hope someone thinks of a better name): --writer-numbering=figure,table, with the short version --writer-numbering meaning "number everything". Can't say from the top of my head if that would be easy to implement in pandoc's argument parser, but it should at least be doable, and this feels considerably more elegant than having N separate options controlling basically the same thing.

P.S. Side note, I would like to see these options also affect LaTeX/PDF output using the default template. FWIW, implementation should be rather straightforward: set some template variables and add a couple lines boiling down to \captionsetup{labelformat=empty} to the default template.

@pyssling
Copy link
Contributor

pyssling commented May 2, 2019

@lierdakil

IMO, that's not really a reasonable expectation that whatever pandoc writes into any output format would play all that nice with that format's internal references (which list of figures/tables basically is). I mean, I don't mind hacking those in, but I really don't think that's pandoc's "primary mission" so to speak.

In my not so humble opinion I'm asking myself: If pandocs mission is NOT to play nice with a formats internal references, then what is pandocs mission? @jgm , any thoughts? I sort of naively assumed we were trying to do the best we can for any given output format.

I will concede that LaTeX is a bit of a special case, sure. But LaTeX is 100% hackable via custom pandoc templates if needed. This is not the case with ODT, so my first expectation is to get the least-embellished human-readable output possible while keeping semantic information. At least by default.

ODT is very hackable via custom templates, and it would be quite simple to make it even more so. You can almost overwrite any style using these. With a little more work and post-processing you can actually achieve extremely solid professional output, such as technical manuals.

For two, while I'm reasonably proficient in LaTeX to hack-in some raw LaTeX blocks where needed, I can't say I could do the same for XML-based formats like ODT or docx without spending weeks untangling their specs. Furthermore, IIRC, counters and especially references in those formats (well, certainly docx) are considerably more complicated than could be achieved with simply inserting raw blocks (which is the limit of what a filter can do).

I understand that. I haven't looked at pandoc-crossref before. Our toolchain is based on asciidoc, which is transformed into Docbook using asciidoctor, which pandoc can then nicely turn into ODT with an advanced stylesheet. I then call libreoffice in batch mode to update table of contents and other indexes which were inserted using the template.

I don't mind making figure and table numbering optional, we anyway call pandoc through a wrapper. However, I don't think it makes sense to disable them by default if pandoc's mission is to provide as useful output as possible to end-users who aren't using additional tooling.

Finally, for me it sounds like pandoc-crossref should work on integrating itself into pandoc if it is painful to duplicate output code for latex for example. But that's just an uninformed opinion. :-)

@lierdakil
Copy link
Contributor Author

@pyssling

best we can for any given output format

This would be wildly inconsistent across different output formats, sadly. Doesn't necessarily mean that output has to be brought to the lowest common denominator, but getting overly fancy isn't really something I'd do lightly too, at least not by default.

In my opinion, pandoc is at its greatest when you need to write a thing and then turn it into multiple output formats. Having different output formats behave at least somewhat consistently helps with this use-case a lot. For a single output format, I'd personally rather just write the thing in that format to begin with (except perhaps docx, because as a Linux user, I hate Word with a passion).

ODT is very hackable via custom templates

Yet there's no way to add/remove those fancy "Figure N" things with just those, apparently. That's what I meant by LaTeX being 100% hackable: it's possible to change almost anything by simply adding some commands to prelude. Certainly easy to change figure captions in any reasonable way.

I don't think it makes sense to disable them by default

Here are a few reasons besides consistency with most other output formats:

It would be extremely inconvenient for pandoc-crossref users targeting ODT if it was enabled by default.

@scottmartincampbell above makes another case where having it enabled by default leads to at least surprising and at worst annoying and confusing results.

Another reason to disable those by default is that re-reading ODT produced by the current writer yields less-than-stellar results. Namely, "Figure N" gets read verbatim, which, if nothing else, is rather surprising behaviour when, e.g., you need to round-trip a document to ODT and back for whatever reason. And I'm not sure a simple fix exists here, since "Figure N" is just good ol' text with little semantic markup in ODT. That said, ODT reader is frankly anaemic, so it's not like this is the only problem it has, but still.

pandoc-crossref should work on integrating itself into pandoc

Granted. But that in itself would be extremely painful. Pandoc-crossref makes quite a few compromises to make things work, which wouldn't really be acceptable in pandoc itself. And doing this properly is, well, for one, a lot of work, for which few of us have nearly enough time, and for two, we're kinda stuck on the design phase, see discussion in #813 for insight. Honestly, the reason pandoc-crossref exists in the first place is I got tired of looking at the discussion in #813 going nowhere.

@pyssling
Copy link
Contributor

pyssling commented May 2, 2019

@lierdakil

This would be wildly inconsistent across different output formats, sadly.

I'm not sure I follow. Do you mean that it would be wildly inconsistent to do the best we can, or the output would be wildly inconsistent, or something else?

Yet there's no way to add/remove those fancy "Figure N" things with just those, apparently.

Unfortunately no. I agree that they should definitely be optional, but I'm not sure that default off is what you want for the sake of naive users, just because of the side-effects in Libreoffice of not including them. That's really just a judgement call though.

It would be extremely inconvenient for pandoc-crossref users targeting ODT if it was enabled by default.

Why? Surely if you're already starting to do advanced filtering, then adding a few options isn't going to do much harm. But ok, it's more options.

@scottmartincampbell above makes another case where having it enabled by default leads to at least surprising and at worst annoying and confusing results.

Regarding org-mode: @scottmartincampbell does org-mode add proper XML tags for the numbers in the "Figure X" or does it just add verbatim "Figure X" to the output. Maybe org-mode could be updated for newer versions of pandoc? I actually find this a pretty good argument for having this enabled by default as other tools are apparently working around deficiencies in old versions of the ODT writer.

And I'm not sure a simple fix exists here, since "Figure N" is just good ol' text with little semantic markup in ODT.

This isn't actually true, the "Figure N" is actually not just good old text but contains a number of XML tags, specifically:
<text:sequence text:ref-name="refFigure0" text:name="Figure" text:formula="ooow:Figure+1" style:num-format="1">
and a few others for good measure. ODT XML is very verbose sometimes, but far nicer than OOXML.

Making the ODT reader understand this shouldn't be that hard and is anyway needed to meaningfully process existing ODT's containing captions created by Libreoffice as these "Figure N" (unless the author intentionally deletes them) will always be present when adding captions.

Granted. But that in itself would be extremely painful.

Sorry to hear that, but I still don't find that as a compelling reason to disable features in ODT writer by default. It seems to me that this would be holding back even more development in pandoc due to a to an tangentially related issue. I.e. What you're saying is that we can't integrate pandoc-crossref into pandoc because of unconcluded discussions and therefore we should disable by default features that interfere with a non-integrated pandoc-crossref. Is that about right?

@lierdakil
Copy link
Contributor Author

@pyssling

I'm not sure I follow.

Unless by "best we can" you mean "the most faithful reproduction of the semantic meaning of input in the output", the "look and feel" of output would be wildly inconsistent between different formats. And it doesn't seem to be what you meant, because, strictly speaking, figure/table numbering is not in the input, and neither are lists of objects.

It would be extremely inconvenient for pandoc-crossref users targeting ODT if it was enabled by default.

Why?

Again, inconsistency. It's counter-intuitive that one would have to add specific command line options to pandoc invocation for one specific output format.

Also, bear in mind that only real requirement for using pandoc-crossref is to invoke pandoc with -Fpandoc-crossref. With all due respect, I can't call that "advanced filtering". But it will quickly turn into "advanced filtering" if it starts acquiring quantifiers a-la "but if outputting to ODT, you also need --no-figure-numering --no-table-numbering" etc -- one would have to know pandoc's idiosyncrasies to use pandoc-crossref, which isn't really what I'd wish for here. I would be less adamant about this if filters could control writer options, but alas, that's generally not the case.

"Figure N" [...] contains a number of XML tags

Well, apart from the number, which is an ODF counter, I mean. "Figure " part is just good ol' plain text.

Making the ODT reader understand this shouldn't be that hard

Using brittle heuristics and ignoring i18n woes, sure, probably not that hard. But doing it well doesn't seem trivial, unless I'm missing something obvious (which I might be).

What you're saying is that we can't integrate pandoc-crossref into pandoc because of unconcluded discussions and therefore we should disable by default features that interfere with a non-integrated pandoc-crossref. Is that about right?

Never said anything to the effect you're describing. Just explained why integrating pandoc-crossref into pandoc is not something that is likely to happen soon (or at all, it's an open question whether added maintenance cost is entirely justified in this particular case)

TL;DR of my position is as follows Having counters inserted in ODT but not in docx or HTML or whatever is not a great user experience and breaks compatibility with certain tools and certain workflows. Inserting counters in all output formats that remotely support something like this is, for one, unimplemented, and for two, a major behaviour change which should never be enabled by default, if one cares about compatibility at all, unless doing a major breaking release (and even then I'd think twice about that). Also, this is personal preference, but I would strongly prefer pandoc to not modify input (like add things) unless I explicitly asked it to. So, to me, hiding this feature behind an option makes a lot more sense, at least until/unless other writers catch up.

Don't get me wrong, I can kinda see your point. Lists-of-stuffs in LibreOffice are tied into counters (which LibreOffice creates for figures by default), so no counters means no list of figures/etc. But I sincerely doubt all that many users either need or expect to have, say, list-of-figures working in pandoc-generated ODT. And if they actually do need that, chances are they also need a similar thing in some other output format, which doesn't support that natively (e.g. HTML) -- at which point they'd want to use pandoc-crossref or equivalent, which wouldn't work properly with ODT by default, so additional hoop-jumping would be required. So, for a small subset of users who need list-of-figures/etc in ODT working out of the box, and who don't need that in any other output format (except perhaps LaTeX) -- for those users having this feature enabled by default makes sense. For everyone else, it would be at best inconsequential (if they don't care about ODT) and at worst annoyingly confusing.

Hopefully this makes my point clear.

P.S. We really need to wind down on this discussion somewhat. Anyone who reads this later probably won't be happy about these walls of text.

@scottmartincampbell
Copy link

Here's my short view: a text-processing tool like pandoc should not modify content unnecessarily by default, or make assumptions about intent of the writer. Adding "Figure N" to the caption does that. An option that exists to better manage structural features like an index or table of contents is worthwhile, but not a good default if there are visible changes to the content.

To answer the above question: I believe org-mode just adds "Figure N" to the visible text, without any tagging. I don't like this either, but there are ways to alter the output (see https://orgmode.org/manual/Labels-and-captions-in-ODT-export.html#Labels-and-captions-in-ODT-export).

@jgm
Copy link
Owner

jgm commented May 3, 2019

We already had inconsistency before this: LaTeX/PDF adds "Figure 1"; other formats don't. That's why I thought this change as a reasonable one (and ditto for similar, not yet implemented changes to other output formats like docx). It hadn't been done up to now because of the localization issue, but that was solved by the Translations API.

That said, I think that the incompatibility with pandoc-crossref is a serious issue. pandoc-crossref is widely used, and since pandoc doesn't provide adequate cross-reference capacities, we should try not to break the pandoc-crossref workflow. Ultimately it would be desirable to build cross-referencing and counters into pandoc itself (see #813), but this is a big issue, and there are aspects to the design of pandoc-crossref that I've been hesitant to bring into pandoc (specifically the use of English word-fragments to mark different counters). I think we need to consider that a long-term improvement (and give it more emphasis), while considering here what to do in the immediate future.

@jgm
Copy link
Owner

jgm commented May 3, 2019

Another note on this: in LaTeX, the numbering is really essential because of the way figures and tables "float." In formats where the figures/tables are guaranteed to appear in a particular place in the text, it's not quite as important. However, when you're targeting multiple output formats, this lack of uniformity is a problem. For LaTeX/PDF output, you can't just say "as the following figure shows..." because the figure might appear somewhere else in the text. You need "as Figure 1 shows..." But with unaltered pandoc you can't achieve that in output formats other than LaTeX/PDF (for which you could use raw tex). The only current way to solve this problem is by using pandoc-crossref.

@pyssling
Copy link
Contributor

pyssling commented May 4, 2019

Well, I really don't know one way or the other. My preference is for ODT documents looking as if they were created by Libreoffice. I.e. that my view of the best we can do. This also plays nicely with other Libreoffice functions like the index.

I don't really see the point of trying to produce "generic" documents where formatting normally present in these documents is omitted, i.e. HTML and ODT won't look the same, no matter what we do, and neither will LaTeX/PDF. So why not go for what the user would get had they written the document in Libreoffice?

That being said, I understand it's quite inconvenient for a lot of people, especially those that have written tools that assume the output is a stable interface.

@jgm
Copy link
Owner

jgm commented Jun 8, 2019

Here's a practical suggestion.
pandoc-crossref will put tables it is numbering inside

,Div ("tbl:table1",[],[])

Figures will look like

Para [Image ("fig:figure2",[],[])

or

Div ("fig:subfigures",["subfigures"],[])
 [Para [Image ("",[],[]) [Str "a"] ("img1.jpg","fig:")]
 ,Para [Image ("fig:subfigureB",[],[]) [Str "b"] ("img1.jpg","fig:")]
  etc.

The opendocument/odt writer will get to the AST after pandoc-crossref is done with it.
So we could restore pandoc-crossref compatibility, without affecting the current behavior, by making the numbering depend on the absence of a pandoc-crossref-signature identifier on these elements.

[EDIT: I admit, it seems a bit hackish to make pandoc's behavior depend on a third-party filter in this way. And someone might use a fig: id without using pandoc-crossref. So maybe this isn't ideal.]

[EDIT: replaced 'pandoc-citeproc' with 'pandoc-crossref']

@jgm
Copy link
Owner

jgm commented Jun 8, 2019

Another idea along similar lines: we could set a variable pandoc-crossref if the pandoc-crossref filter has been specified on the command line, and the writer could check this.

jgm added a commit that referenced this issue Jun 8, 2019
This was added in pandoc 2.7.2, but it makes it impossible
to use pandoc-crossref. So this has been rolled back for now,
until we find a good solution to make this behavior optional
(or a creative way to let pandoc-crossref and this feature
to coexist).

See #5474.
@jgm
Copy link
Owner

jgm commented Jun 8, 2019

For now I'm just going to roll this back; I think it's important to be able to use pandoc-crossref, since there's no other way currently to get cross-referencing. But leaving this issue open, so we can think about the best way to add this feature as an option.

I've left all the needed code in place (behind if False).

@pyssling
Copy link
Contributor

@jgm I'd like to revisit this issue now as I need this functionality. For now we're sticking with an older version of pandoc which doesn't disable figure numbering.

What do you think is the best way forward? Should we add a commandline switch to enable figure/table/other enumeration? Or one to disable it? Or some specific detection of pandoc-crossref?

I'm afraid the current state is a bit sad as it makes it impossible to generate table of figures and table of tables in libreoffice, which are really needed for technical publishing.

@jgm jgm added this to the 2.8 milestone Sep 21, 2019
@jgm
Copy link
Owner

jgm commented Sep 21, 2019

I'll put it on the 2.8 milestone just so I think about it again, but no guarantees for this release; we might take it off again.

@lierdakil
Copy link
Contributor Author

I think it's a good candidate for handling with extensions. Probably need to add an extension specifically for this. The end result will look like pandoc -t odt+native_numbering or something like that (the extension name is from the top of my head, you could likely think of a better name).

My vote is for "disabled by default" for the sake of backwards-compatibility (rule of the least surprise -- in general, things should behave differently only when the user does something differently, unless we're talking really major updates).

@jgm
Copy link
Owner

jgm commented Sep 21, 2019

+native_numbering sounds reasonable to me

@pyssling
Copy link
Contributor

Ok, that is, a command line switch extension of the "to" command line option. I see that some extensions are already present on other formats. I'll try to follow one of those examples. +native_numbering sounds good. That's pretty much what this is.

@pyssling
Copy link
Contributor

Ok, I created a pull request: #5765 . Hope that's what you intended more or less.

@lierdakil
Copy link
Contributor Author

Closing as #5765 is merged. Thanks @pyssling!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants