Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docx reader: smartTags omitted #3412

Closed
uoou opened this issue Feb 3, 2017 · 3 comments
Closed

Docx reader: smartTags omitted #3412

uoou opened this issue Feb 3, 2017 · 3 comments

Comments

@uoou
Copy link

uoou commented Feb 3, 2017

Deep apologies if this is known - I did search but it's tricky to search for.

I was converting a .docx to markdown (tried other formats after finding the bug and they all seem affected) and on conversion, certain words, all proper nouns, were being omitted.

I don't know for sure, I don't have access to MS Word, but I suspect what's happening is that Word is auto-capitalising these proper nouns and then perhaps marking them in some way with some hidden character and pandoc is reading them as garbage.

If I use google docs as a filter - i.e. upload the .docx to google docs then download it again (still as .docx) the problem is resolved.

I will attach the afflicted .docx file. It's under a CC-ND-SA license.

An example passage where the bug occurs is:

"but it’s worth remembering that King James, to whom Bacon dedicated the book—and who was at the time one of the finest scholars in Europe—was completely baffled"

On conversion, the word "Europe" is omitted.

sh.docx

Thanks!

@mb21
Copy link
Collaborator

mb21 commented Feb 3, 2017

The problem is probably the smartTag, from the posted docx:

<w:r>
  <w:t xml:space="preserve">—the power to predict natural phenomena and so to cope with some of the dangers, and take advantage of some of the opportunities, that they present. To us, now, this seems blindingly obvious; but it’s worth remembering that King James, to whom Bacon dedicated the book—and who was at the time one of the finest scholars in </w:t>
</w:r>
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
  <w:r>
    <w:t>Europe</w:t>
  </w:r>
</w:smartTag>
<w:r>

@mb21 mb21 changed the title Auto-capitalised words in .docx omitted on conversion Docx reader: smartTags omitted Feb 3, 2017
@mb21
Copy link
Collaborator

mb21 commented Feb 3, 2017

ah, this is a duplicate of #2242 then...

@mb21 mb21 closed this as completed Feb 3, 2017
jgm added a commit that referenced this issue Feb 3, 2017
This just parses inside smartTags and yields their contents,
ignoring the attributes of the smartTag.  @jkr, you may want
to adjust this, but I wanted to get a fix in as fast as possible
for the dropped content.

Closes #2242; see also #3412.
@abartov
Copy link

abartov commented Sep 29, 2020

Thank you very much for fixing this, @jgm! Random letters in my docx (beyond my control) are in "SmartTag" tags (for no discernible reason), and their disappearance was driving my crazy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants