Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

Closed
lzlcfd opened this issue Jun 18, 2019 · 13 comments

Comments

@lzlcfd
Copy link

lzlcfd commented Jun 18, 2019

I want to convert the tex file into a docx document using the following command:

pandoc -f latex -t docx -o master.docx --bibliography ./proposal.bib master.tex

but pandoc gave the following error:

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

I am assured that the tex file is saved as the utf-8 format, but still the problem occurred.

@wilx
Copy link
Contributor

wilx commented Jun 19, 2019

Can you reduce your file to a smallest possible test case that exhibits the issue?

@mb21
Copy link
Collaborator

mb21 commented Jun 19, 2019

I am assured that the tex file is saved as the utf-8 format

Just in case, try: https://pandoc.org/MANUAL.html#character-encoding

@lzlcfd
Copy link
Author

lzlcfd commented Jun 19, 2019

I am assured that the tex file is saved as the utf-8 format

Just in case, try: https://pandoc.org/MANUAL.html#character-encoding

Thank you for your reply, I tried the command line as you suggested, but still failed.

@agusmba
Copy link
Contributor

agusmba commented Jun 19, 2019

Can you reduce your file to a smallest possible test case that exhibits the issue?

@jgm
Copy link
Owner

jgm commented Jun 21, 2019

You must be using an older version of pandoc, because recent versions will give the byte position of the decoding error; that may help in tracking this down. Your input must not be valid UTF-8.

@jgm
Copy link
Owner

jgm commented Jun 21, 2019

Check the bib file too!

@mb21
Copy link
Collaborator

mb21 commented Jun 22, 2019

Closing until we have a reproducible example...

@tranner
Copy link

tranner commented Nov 13, 2019

I ran into similar problems and thought I would give some more details for others searching for the same issue.

A (quite) minimal example of the error:

\usepackage{amsaddr}

with amsaddr downloaded from https://www.ctan.org/tex-archive/macros/latex/contrib/amsaddr.

 $ pandoc --version
pandoc 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
Default user data directory: /localhome/scstr/.local/share/pandoc or /localhome/scstr/.pandoc
Copyright (C) 2006-2019 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
$ echo "\usepackage{amsaddr}" | pandoc -t native
pandoc: Cannot decode byte '\xe9': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

Clearly the iconv fix doesn't make any difference here in the way suggest at https://pandoc.org/MANUAL.html#character-encoding but a possible fix is to look into amsaddr.sty and remove the bad characters from line 9

- %% Copyright (C) 2006 by J�r�me Lelong <jerome.lelong@gmail.com>
+ %% Copyright (C) 2006 by Lelong <jerome.lelong@gmail.com>

Then we have:

$ echo "\usepackage{amsaddr}" | pandoc -t native
[RawBlock (Format "tex") "\\usepackage{amsaddr}"]

The bad characters are in a comment so perhaps you would hope that this does not cause problems. Either way, this is a possible fix for users.

@jgm jgm reopened this Nov 13, 2019
@jgm jgm closed this as completed Nov 13, 2019
@defwxyz
Copy link

defwxyz commented Feb 20, 2021

This maybe related to this https://serokell.io/blog/haskell-with-utf8

How to reproduce :

pandoc https://www.google.com
pandoc: Cannot decode byte '\xe9': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

I use latest version on debian linux buster

pandoc --version
pandoc 2.11.4
Compiled with pandoc-types 1.22, texmath 0.12.1, skylighting 0.10.2,
citeproc 0.3.0.5, ipynb 0.1.0.1

I do this as a workaround :

  1. get the html source page (from http://www.google.com) with a small program (with a simpleHTTP (getRequest url) from Network.HTTP haskell module)
  2. Then I use that file as input to pandoc

And doing so there is no problem to read input and to convert it to any format

@jgm
Copy link
Owner

jgm commented Feb 20, 2021

Well, I looked into this. Despite using a meta tag that says it's UTF-8, the Google page is not UTF-8 encoded, nor is it served with a UTF-8 mime type (curl reports content type is text/html; charset=ISO-8859-1).
There is exactly one byte > 0x80, an 0xa0 between "Advertising" and "Program". 0xa0 is the unicode code point for nonbreaking space, but to encode this in UTF-8 you'd need multiple bytes.

Bottom line: this isn't a bug in pandoc. It's a bug in Google's home page.

@agusmba
Copy link
Contributor

agusmba commented Feb 22, 2021

While I agree, Google's homepage does not follow W3C recommendations regarding encoding (headers and meta should be coherent), it seems that HTTP Headers take precedence, and the headers state ISO-8859-1, so browsers treat Google's home page text as ISO... instead of UTF8.

Getting encoding right is always a bit messy (as you can see from Google itself)

See also AngleSharp/AngleSharp#295 with a similar issue

@jgm
Copy link
Owner

jgm commented Feb 22, 2021

Yes, I know this about browsers. But pandoc only accepts UTF-8 input, as documented. So because the page is not UTF-8, it doesn't work...as expected. You can use curl and iconv in cases like this.

We could actually make this work more smoothly, I guess, by modifying readURI in Text.Pandoc.App.
Right now it's

readURI :: FilePath -> PandocIO Text
readURI src = UTF8.toText . fst <$> openURL (T.pack src)

But since openURL returns a mime type, we could use that to try to determine how to decode the bytestring and use relevant functions in Data.Text.Encoding. If you think this is a good idea, perhaps a new issue should be opened.

jgm added a commit that referenced this issue Feb 22, 2021
the character encoding.  We can properly handle UTF-8 and
latin1 (ISO-8859-1); for others we raise an error.
See #5600.
@jgm
Copy link
Owner

jgm commented Feb 22, 2021

I made this change. So now we should be okay with at least latin1 and utf-8 encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants