pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

lzlcfd · 2019-06-18T21:47:52Z

I want to convert the tex file into a docx document using the following command:

pandoc -f latex -t docx -o master.docx --bibliography ./proposal.bib master.tex

but pandoc gave the following error:

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

I am assured that the tex file is saved as the utf-8 format, but still the problem occurred.

wilx · 2019-06-19T05:29:02Z

Can you reduce your file to a smallest possible test case that exhibits the issue?

mb21 · 2019-06-19T07:31:57Z

I am assured that the tex file is saved as the utf-8 format

Just in case, try: https://pandoc.org/MANUAL.html#character-encoding

lzlcfd · 2019-06-19T18:20:50Z

I am assured that the tex file is saved as the utf-8 format

Just in case, try: https://pandoc.org/MANUAL.html#character-encoding

Thank you for your reply, I tried the command line as you suggested, but still failed.

agusmba · 2019-06-19T20:41:48Z

Can you reduce your file to a smallest possible test case that exhibits the issue?

jgm · 2019-06-21T16:54:20Z

You must be using an older version of pandoc, because recent versions will give the byte position of the decoding error; that may help in tracking this down. Your input must not be valid UTF-8.

jgm · 2019-06-21T16:54:38Z

Check the bib file too!

mb21 · 2019-06-22T08:33:53Z

Closing until we have a reproducible example...

tranner · 2019-11-13T11:22:34Z

I ran into similar problems and thought I would give some more details for others searching for the same issue.

A (quite) minimal example of the error:

\usepackage{amsaddr}

with amsaddr downloaded from https://www.ctan.org/tex-archive/macros/latex/contrib/amsaddr.

 $ pandoc --version
pandoc 2.7.3
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1
Default user data directory: /localhome/scstr/.local/share/pandoc or /localhome/scstr/.pandoc
Copyright (C) 2006-2019 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

$ echo "\usepackage{amsaddr}" | pandoc -t native
pandoc: Cannot decode byte '\xe9': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

Clearly the iconv fix doesn't make any difference here in the way suggest at https://pandoc.org/MANUAL.html#character-encoding but a possible fix is to look into amsaddr.sty and remove the bad characters from line 9

- %% Copyright (C) 2006 by J�r�me Lelong <jerome.lelong@gmail.com>
+ %% Copyright (C) 2006 by Lelong <jerome.lelong@gmail.com>

Then we have:

$ echo "\usepackage{amsaddr}" | pandoc -t native
[RawBlock (Format "tex") "\\usepackage{amsaddr}"]

The bad characters are in a comment so perhaps you would hope that this does not cause problems. Either way, this is a possible fix for users.

defwxyz · 2021-02-20T08:21:00Z

This maybe related to this https://serokell.io/blog/haskell-with-utf8

How to reproduce :

pandoc https://www.google.com
pandoc: Cannot decode byte '\xe9': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

I use latest version on debian linux buster

pandoc --version
pandoc 2.11.4
Compiled with pandoc-types 1.22, texmath 0.12.1, skylighting 0.10.2,
citeproc 0.3.0.5, ipynb 0.1.0.1

I do this as a workaround :

get the html source page (from http://www.google.com) with a small program (with a simpleHTTP (getRequest url) from Network.HTTP haskell module)
Then I use that file as input to pandoc

And doing so there is no problem to read input and to convert it to any format

jgm · 2021-02-20T18:27:22Z

Well, I looked into this. Despite using a meta tag that says it's UTF-8, the Google page is not UTF-8 encoded, nor is it served with a UTF-8 mime type (curl reports content type is text/html; charset=ISO-8859-1).
There is exactly one byte > 0x80, an 0xa0 between "Advertising" and "Program". 0xa0 is the unicode code point for nonbreaking space, but to encode this in UTF-8 you'd need multiple bytes.

Bottom line: this isn't a bug in pandoc. It's a bug in Google's home page.

agusmba · 2021-02-22T10:04:00Z

While I agree, Google's homepage does not follow W3C recommendations regarding encoding (headers and meta should be coherent), it seems that HTTP Headers take precedence, and the headers state ISO-8859-1, so browsers treat Google's home page text as ISO... instead of UTF8.

Getting encoding right is always a bit messy (as you can see from Google itself)

See also AngleSharp/AngleSharp#295 with a similar issue

jgm · 2021-02-22T17:45:50Z

Yes, I know this about browsers. But pandoc only accepts UTF-8 input, as documented. So because the page is not UTF-8, it doesn't work...as expected. You can use curl and iconv in cases like this.

We could actually make this work more smoothly, I guess, by modifying readURI in Text.Pandoc.App.
Right now it's

readURI :: FilePath -> PandocIO Text
readURI src = UTF8.toText . fst <$> openURL (T.pack src)

But since openURL returns a mime type, we could use that to try to determine how to decode the bytestring and use relevant functions in Data.Text.Encoding. If you think this is a good idea, perhaps a new issue should be opened.

the character encoding. We can properly handle UTF-8 and latin1 (ISO-8859-1); for others we raise an error. See #5600.

jgm · 2021-02-22T22:05:20Z

I made this change. So now we should be okay with at least latin1 and utf-8 encodings.

mb21 added the status:more-info-needed label Jun 19, 2019

mb21 closed this as completed Jun 22, 2019

vhquang mentioned this issue Nov 3, 2019

streamDecodeUtf8With: Invalid UTF-8 stream when convert from Markdown to PDF #5872

Closed

jgm reopened this Nov 13, 2019

jgm closed this as completed Nov 13, 2019

jgm added a commit that referenced this issue Feb 22, 2021

When downloading content from URL arguments, be sensitive to...

5a73c5d

the character encoding. We can properly handle UTF-8 and latin1 (ISO-8859-1); for others we raise an error. See #5600.

cderv mentioned this issue Mar 9, 2021

Non-English characters rstudio/distill#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

lzlcfd commented Jun 18, 2019

wilx commented Jun 19, 2019

mb21 commented Jun 19, 2019

lzlcfd commented Jun 19, 2019

agusmba commented Jun 19, 2019

jgm commented Jun 21, 2019

jgm commented Jun 21, 2019

mb21 commented Jun 22, 2019

tranner commented Nov 13, 2019

defwxyz commented Feb 20, 2021 •

edited

Loading

jgm commented Feb 20, 2021 •

edited

Loading

agusmba commented Feb 22, 2021 •

edited

Loading

jgm commented Feb 22, 2021

jgm commented Feb 22, 2021

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600

Comments

lzlcfd commented Jun 18, 2019

wilx commented Jun 19, 2019

mb21 commented Jun 19, 2019

lzlcfd commented Jun 19, 2019

agusmba commented Jun 19, 2019

jgm commented Jun 21, 2019

jgm commented Jun 21, 2019

mb21 commented Jun 22, 2019

tranner commented Nov 13, 2019

defwxyz commented Feb 20, 2021 • edited Loading

jgm commented Feb 20, 2021 • edited Loading

agusmba commented Feb 22, 2021 • edited Loading

jgm commented Feb 22, 2021

jgm commented Feb 22, 2021

defwxyz commented Feb 20, 2021 •

edited

Loading

jgm commented Feb 20, 2021 •

edited

Loading

agusmba commented Feb 22, 2021 •

edited

Loading