Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Reader: Regression for figure without figcaption #4183

Closed
davidar opened this issue Dec 21, 2017 · 8 comments · Fixed by #4184
Closed

HTML Reader: Regression for figure without figcaption #4183

davidar opened this issue Dec 21, 2017 · 8 comments · Fixed by #4184

Comments

@davidar
Copy link
Contributor

davidar commented Dec 21, 2017

HTML figures without a corresponding figcaption should fall back to the alt text of the img (as was the behaviour prior to #3813) rather than having no caption at all.

$ pandoc -v
pandoc 2.0.5
$ echo '<figure><img src="foo" alt="bar"></figure>' | pandoc -f html -t markdown
![](foo)
$ echo '<img src="foo" alt="bar">' | pandoc -f html -t markdown
![bar](foo)

A related issue is that figcaption is ignored when it contains block tags like div or p:

$ echo '<figure><img src="foo" alt="bar"><figcaption><p>baz</p></figcaption></figure>' \
  | pandoc -f html -t markdown
![](foo)
$ echo '<figure><img src="foo" alt="bar"><figcaption><div>baz</div></figcaption></figure>' \
  | pandoc -f html -t markdown
![](foo)
$ echo '<figure><img src="foo" alt="bar"><figcaption><span>baz</span></figcaption></figure>' \
  | pandoc -f html -t markdown
![baz](foo)
mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017
mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017
mb21 added a commit to mb21/pandoc that referenced this issue Dec 21, 2017
@jgm
Copy link
Owner

jgm commented Dec 21, 2017

HTML figures without a corresponding figcaption should fall back to the alt text of the img (as was the behaviour prior to #3813) rather than having no caption at all.

I'm not sure why. In the HTML page as displayed by a browser, there will be no caption. So, why should pandoc parse this as a figure with a caption?

@davidar
Copy link
Contributor Author

davidar commented Dec 22, 2017

For the same reason this happens:

$ echo '<p>baz</p><p><img src="foo" alt="bar"></p><p>baz</p>' | pandoc -f html -t markdown
baz

![bar](foo)

baz

$ echo '<p>baz</p><p><img src="foo" alt="bar"></p><p>baz</p>' | pandoc -f html -t markdown \
  | pandoc
<p>baz</p>
<figure>
<img src="foo" alt="bar" /><figcaption>bar</figcaption>
</figure>
<p>baz</p>

@jgm
Copy link
Owner

jgm commented Dec 22, 2017

A better fix for the round-trip failure you point to would be making the markdown output from the first step something like

![bar](foo)<!-- nofig -->

to avoid having it interpreted as an implicit figure.

@davidar
Copy link
Contributor Author

davidar commented Dec 23, 2017

Basically all I'm saying is that wrapping an img in figure shouldn't result in the alt text being completely ignored, I'm not sure what the best way to handle that would be.

mb21 added a commit to mb21/pandoc that referenced this issue Dec 23, 2017
@mb21
Copy link
Collaborator

mb21 commented Dec 25, 2017

I'm not sure why. In the HTML page as displayed by a browser, there will be no caption.

I changed the pull request accordingly.

I guess parsing the alt text when there is no figcaption will have to wait for #3177

@jgm jgm closed this as completed in #4184 Dec 27, 2017
@gwern
Copy link
Contributor

gwern commented Apr 23, 2021

I disagree about the need for waiting for #3177 (which has been 4 years and shows no particular sign of being solved soon). Pandoc already supports the necessary construct in the form of attributes, and already correctly parses and deals with img tags with alt attributes... just only in the Markdown->HTML direction:

$ echo '![](foo.jpg){alt="bar"}'                              | pandoc -f markdown -t html
<p><img src="foo.jpg" alt="bar" /></p>
$ echo '<p><img src="foo.jpg" alt="bar" /></p>' | pandoc -f html -t markdown
![bar](foo.jpg)

And of course, Pandoc even solves figures to some degree by treating it as a HTML literal... but only if you parse the HTML as Markdown, as parsing the HTML as HTML delivers worse results, erasing attributes and classes:

$ echo '<figure><img src="foo.jpg" alt="bar" /></figure>' | pandoc -f html -t markdown
![](foo.jpg)
$ echo '<figure><img src="foo.jpg" alt="bar" /></figure>' | pandoc -f markdown -t markdown
 \```{=html}
<figure>
 \```
 \`<img src="foo.jpg" alt="bar" />`{=html}
 \```{=html}
</figure>
 \```

The HTML reader just needs to be improved to handle alt/title/class attributes on imgs,

@tarleb
Copy link
Collaborator

tarleb commented Apr 23, 2021

waiting for #3177 (which has been 4 years and shows no particular sign of being solved soon).

We aim to tackle this in a GSoC project this summer, see #7184.

@jgm
Copy link
Owner

jgm commented Apr 29, 2021

We could easily change the reader so that the alt was used as the caption when no caption exists. But I'm not sure this is the right way forward; it would mean that a figure without a caption became a figure with a caption after round-tripping html -> markdown -> html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants