-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created ODT documents can contain invalid characters #5119
Comments
|
Sorry, I don't have the time to test if the bug happens in the newest pandoc release right now. Here is the minimal example file I used in the first post( embeding markdown into markdown was... not the best idea) |
Your markdown file contains the control character EOT (0x04) right before the word "Host". |
Rendering the EOT as |
I suppose we could modify |
Agreed, if we start stripping unknown Unicode characters, we'll always be one step behind adding the newest emojis... |
See related issue #5042 |
This character probably came to be because the snippet is from an interupted sqlmap run. It would be nice to have a warning about potentially invalid characters though - I think trying to convert to PDF directly errored out in pdflatex and that's why I tried converting to odt in the first place. Ideally, it would be nice to have a list of characters invalid for each format, but that seems like a lot of effort for little gain. I don't think emoji are likely to break quite as many formats as control characters will. A generic warning for "Contains control characters, might break some formats" shown only when running verbose could work too. It's fair if you want to close this as |
One thing we could do is to emit warnings when we have non-printable / invisible characters in the input stream. We could even add the option to use a custom whitelist, so people can choose whether to get a warning on e.g. When using |
U+0004 isn't a legal XML character anyway, so pandoc's generating ill-formed XML. pandoc definitely ought to be able to detect this case, though I'm not sure whether it should be an error, a warning + passthrough because GIGO, or a warning + stripping. Some formats might place additional restrictions on what characters are allowed and those might be more troublesome to check for, especially if they're not written with future versions of Unicode in mind. I think it would make sense to cross that bridge when we come to it. |
Copied here for convenience:
I think the easiest thing to do would be to change |
This is the code that made pandoc create an invalid ODT document.
MySQL
It was converted with
When trying to open it in LibreOffice, one is greeted with an error message saying
Read-Error. Format error discovered in the file in sub document content.xml at 34,37
. This is the line found at 34,37:Pandoc version from Ubuntu 16.04:
I'm including the created corrupted file:
pandoc_test.odt.zip
The text was updated successfully, but these errors were encountered: