-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream #5600
Comments
Can you reduce your file to a smallest possible test case that exhibits the issue? |
Just in case, try: https://pandoc.org/MANUAL.html#character-encoding |
Thank you for your reply, I tried the command line as you suggested, but still failed. |
|
You must be using an older version of pandoc, because recent versions will give the byte position of the decoding error; that may help in tracking this down. Your input must not be valid UTF-8. |
Check the bib file too! |
Closing until we have a reproducible example... |
I ran into similar problems and thought I would give some more details for others searching for the same issue. A (quite) minimal example of the error: \usepackage{amsaddr} with
Clearly the iconv fix doesn't make any difference here in the way suggest at https://pandoc.org/MANUAL.html#character-encoding but a possible fix is to look into amsaddr.sty and remove the bad characters from line 9 - %% Copyright (C) 2006 by J�r�me Lelong <jerome.lelong@gmail.com>
+ %% Copyright (C) 2006 by Lelong <jerome.lelong@gmail.com> Then we have:
The bad characters are in a comment so perhaps you would hope that this does not cause problems. Either way, this is a possible fix for users. |
This maybe related to this https://serokell.io/blog/haskell-with-utf8 How to reproduce :
I use latest version on debian linux buster
I do this as a workaround :
And doing so there is no problem to read input and to convert it to any format |
Well, I looked into this. Despite using a meta tag that says it's UTF-8, the Google page is not UTF-8 encoded, nor is it served with a UTF-8 mime type (curl reports content type is Bottom line: this isn't a bug in pandoc. It's a bug in Google's home page. |
While I agree, Google's homepage does not follow W3C recommendations regarding encoding (headers and meta should be coherent), it seems that HTTP Headers take precedence, and the headers state Getting encoding right is always a bit messy (as you can see from Google itself) See also AngleSharp/AngleSharp#295 with a similar issue |
Yes, I know this about browsers. But pandoc only accepts UTF-8 input, as documented. So because the page is not UTF-8, it doesn't work...as expected. You can use We could actually make this work more smoothly, I guess, by modifying readURI :: FilePath -> PandocIO Text
readURI src = UTF8.toText . fst <$> openURL (T.pack src) But since openURL returns a mime type, we could use that to try to determine how to decode the bytestring and use relevant functions in Data.Text.Encoding. If you think this is a good idea, perhaps a new issue should be opened. |
the character encoding. We can properly handle UTF-8 and latin1 (ISO-8859-1); for others we raise an error. See #5600.
I made this change. So now we should be okay with at least latin1 and utf-8 encodings. |
I want to convert the tex file into a docx document using the following command:
pandoc -f latex -t docx -o master.docx --bibliography ./proposal.bib master.tex
but pandoc gave the following error:
pandoc: Cannot decode byte '\xf6': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream
I am assured that the tex file is saved as the utf-8 format, but still the problem occurred.
The text was updated successfully, but these errors were encountered: