-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove extraneous, significant whitespace in JATS writer output #4335
Conversation
…> tags Text wrapping, and indentation in paragraphs, causes creates extra whitespace. The JATS spec has a content model for `<p>` tags of `(#PCDATA | ...`. Any tag where `#PCDATA` children are possible should not have any indentation. This is consistent with the Pandoc HTML writer.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
These tags contain `#PCDATA` so shouldn't introduce extra whitespace.
src/Text/Pandoc/Writers/JATS.hs
Outdated
let render' :: Doc -> Text | ||
render' = render colwidth | ||
render' = render Nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! One comment on this line. This disables wrapping entirely. I don't think we need to do anything that drastic. Text.Pandoc.Pretty provides a nowrap
combinator which can locally disable wrapping when that is absolutely needed. But normal wrapping, which occurs on Space elements, should be innocuous, so you should only need the nowrap
in a few special cases (e.g. verbatim).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @jgm. I've reinstated the wrapping now.
@jgm : I reinstated wrapping with 10eb42d which broke the tests but which illustrates the issue of introducing extra whitespace when both indentation and wrapping enabled as currently implemented. Here's an example from the breaking test in Without wrapping:<sec id="paragraphs">
<title>Paragraphs</title>
<p>Here’s a regular paragraph.</p>
<p>In Markdown 1.0.0 and earlier. Version 8. This line turns into a list item. Because a hard-wrapped line in the middle of a paragraph looked like a list item.</p>
<p>Here’s one with a bullet. * criminey.</p>
<p>There should be a hard line break<break />here.</p>
</sec> With wrapping:<sec id="paragraphs">
<title>Paragraphs</title>
<p>Here’s a regular paragraph.</p>
<p>In Markdown 1.0.0 and earlier. Version 8. This line turns into a list
item. Because a hard-wrapped line in the middle of a paragraph looked like a
list item.</p>
<p>Here’s one with a bullet. * criminey.</p>
<p>There should be a hard line break<break />here.</p>
</sec> Note that because the I'm afraid I don't have a good enough knowledge of Haskell or Pandoc to be able to fix this myself. |
So, is it the case that in JATS (unlike DocBook or HTML),
there is a semantic difference between two spaces and one
space inside a `<p>` element?
Where is this documented?
+++ Nokome Bentley [Feb 26 18 00:31 ]:
… Note that because the <p> is a child of <sec> that an extra 2
significant spaces are inserted at the start of each wrapped line.
|
You're right, I wrongly assumed there was a semantic difference but that might not be the case. I did some searching but found it difficult to find clear documentation on this matter in the JATS standard https://jats.nlm.nih.gov/archiving/tag-library/1.1d1/. The best I found was the "Whitespace Handling" section in https://www.ncbi.nlm.nih.gov/books/NBK425547/ which says:
Which implies that processors usually combine more than one space into one space. Pinging @michael and @oliver---- who may be able to shed some light on the matter. Either way, as you say a user can turn off wrapping locally, so I'm happy to defer to your judgement and adjust the test case so that it with wrapping on. |
I'd prefer it if the test case had wrapping, because that would be consistent with the other writer tests. |
@jgm: I have reintroduced wrapping into the test case and all tests are now passing. |
Great, thanks for the patch! |
Version 2.1.3 of Pandoc now deals with extra whitespace in JATS jgm/pandoc#4335
The JATS writer produces more readable XML by using indentation. However, it was inappropriately indenting the content of elements which may contain
#PCDATA
, thereby adding extraneous, significant whitespace. In addition, it had wrapping turned on which added more whitespace.I checked all the uses of
inTagsIndented
insrc/Text/Pandoc/Writers/JATS.hs
against the JATS spec an fixed those tags which can have#PCDATA
content:<caption>
<def-item>
<def>
<disp-quote>
<fn>
<label>
<list-item>
<p>
<ref-list>
<tbody>
<td>
<term>
<th>
<thead>
<tr>