HTML Reader: Regression for figure without figcaption #4183

davidar · 2017-12-21T03:55:26Z

HTML figures without a corresponding figcaption should fall back to the alt text of the img (as was the behaviour prior to #3813) rather than having no caption at all.

$ pandoc -v
pandoc 2.0.5
$ echo '<figure><img src="foo" alt="bar"></figure>' | pandoc -f html -t markdown
![](foo)
$ echo '<img src="foo" alt="bar">' | pandoc -f html -t markdown
![bar](foo)

A related issue is that figcaption is ignored when it contains block tags like div or p:

$ echo '<figure><img src="foo" alt="bar"><figcaption><p>baz</p></figcaption></figure>' \
  | pandoc -f html -t markdown
![](foo)
$ echo '<figure><img src="foo" alt="bar"><figcaption><div>baz</div></figcaption></figure>' \
  | pandoc -f html -t markdown
![](foo)
$ echo '<figure><img src="foo" alt="bar"><figcaption><span>baz</span></figcaption></figure>' \
  | pandoc -f html -t markdown
![baz](foo)

The text was updated successfully, but these errors were encountered:

fixes jgm#4183

jgm · 2017-12-21T20:02:18Z

HTML figures without a corresponding figcaption should fall back to the alt text of the img (as was the behaviour prior to #3813) rather than having no caption at all.

I'm not sure why. In the HTML page as displayed by a browser, there will be no caption. So, why should pandoc parse this as a figure with a caption?

davidar · 2017-12-22T15:03:34Z

For the same reason this happens:

$ echo '<p>baz</p><p><img src="foo" alt="bar"></p><p>baz</p>' | pandoc -f html -t markdown
baz

![bar](foo)

baz

$ echo '<p>baz</p><p><img src="foo" alt="bar"></p><p>baz</p>' | pandoc -f html -t markdown \
  | pandoc
<p>baz</p>
<figure>
<img src="foo" alt="bar" /><figcaption>bar</figcaption>
</figure>
<p>baz</p>

jgm · 2017-12-22T19:29:41Z

A better fix for the round-trip failure you point to would be making the markdown output from the first step something like

![bar](foo)<!-- nofig -->

to avoid having it interpreted as an implicit figure.

davidar · 2017-12-23T06:13:45Z

Basically all I'm saying is that wrapping an img in figure shouldn't result in the alt text being completely ignored, I'm not sure what the best way to handle that would be.

fixes jgm#4183

mb21 · 2017-12-25T20:21:50Z

I'm not sure why. In the HTML page as displayed by a browser, there will be no caption.

I changed the pull request accordingly.

I guess parsing the alt text when there is no figcaption will have to wait for #3177

gwern · 2021-04-23T01:46:58Z

I disagree about the need for waiting for #3177 (which has been 4 years and shows no particular sign of being solved soon). Pandoc already supports the necessary construct in the form of attributes, and already correctly parses and deals with img tags with alt attributes... just only in the Markdown->HTML direction:

$ echo '![](foo.jpg){alt="bar"}'                              | pandoc -f markdown -t html
<p><img src="foo.jpg" alt="bar" /></p>
$ echo '<p><img src="foo.jpg" alt="bar" /></p>' | pandoc -f html -t markdown
![bar](foo.jpg)

And of course, Pandoc even solves figures to some degree by treating it as a HTML literal... but only if you parse the HTML as Markdown, as parsing the HTML as HTML delivers worse results, erasing attributes and classes:

$ echo '<figure><img src="foo.jpg" alt="bar" /></figure>' | pandoc -f html -t markdown
![](foo.jpg)
$ echo '<figure><img src="foo.jpg" alt="bar" /></figure>' | pandoc -f markdown -t markdown
 \```{=html}
<figure>
 \```
 \`<img src="foo.jpg" alt="bar" />`{=html}
 \```{=html}
</figure>
 \```

The HTML reader just needs to be improved to handle alt/title/class attributes on imgs,

tarleb · 2021-04-23T07:48:09Z

waiting for #3177 (which has been 4 years and shows no particular sign of being solved soon).

We aim to tackle this in a GSoC project this summer, see #7184.

jgm · 2021-04-29T02:28:30Z

We could easily change the reader so that the alt was used as the caption when no caption exists. But I'm not sure this is the right way forward; it would mean that a figure without a caption became a figure with a caption after round-tripping html -> markdown -> html.

mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017

HTML Reader: be more forgiving about figcaption

671f912

fixes jgm#4183

mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017

HTML Reader: be more forgiving about figcaption

88d4331

fixes jgm#4183

mb21 mentioned this issue Dec 21, 2017

HTML Reader: be more forgiving about figcaption #4184

Merged

mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017

HTML Reader: be more forgiving about figcaption

893fef2

fixes jgm#4183

mb21 added a commit to mb21/pandoc that referenced this issue Dec 23, 2017

HTML Reader: be more forgiving about figcaption

9b54b94

fixes jgm#4183

link2xt added format:HTML reader labels Dec 24, 2017

jgm closed this as completed in #4184 Dec 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Reader: Regression for figure without figcaption #4183

HTML Reader: Regression for figure without figcaption #4183

davidar commented Dec 21, 2017

jgm commented Dec 21, 2017

davidar commented Dec 22, 2017

jgm commented Dec 22, 2017

davidar commented Dec 23, 2017

mb21 commented Dec 25, 2017

gwern commented Apr 23, 2021

tarleb commented Apr 23, 2021

jgm commented Apr 29, 2021

HTML Reader: Regression for figure without figcaption #4183

HTML Reader: Regression for figure without figcaption #4183

Comments

davidar commented Dec 21, 2017

jgm commented Dec 21, 2017

davidar commented Dec 22, 2017

jgm commented Dec 22, 2017

davidar commented Dec 23, 2017

mb21 commented Dec 25, 2017

gwern commented Apr 23, 2021

tarleb commented Apr 23, 2021

jgm commented Apr 29, 2021