Created ODT documents can contain invalid characters #5119

pshem · 2018-12-02T16:52:14Z

This is the code that made pandoc create an invalid ODT document.

MySQL

root@kali:~# hydra -e nsr -l root -P /usr/share/wordlists/metasploit/unix_passwords.txt mysql://192.168.1.10/ -I
Hydra v8.6 (c) 2017 by van Hauser/THC - Please do not use in military or secret service organizations, or for illegal purposes.

Hydra (http://www.thc.org/thc-hydra) starting at 2018-11-21 12:44:18
[INFO] Reduced number of tasks to 4 (mysql does not like many parallel connections)
[DATA] max 4 tasks per 1 server, overall 4 tasks, 1012 login tries (l:1/p:1012), ~253 tries per task
[DATA] attacking mysql://192.168.1.10:3306/
[ERROR] �Host '192.168.1.200' is not allowed to connect to this MySQL server

It was converted with

pshem@Computer /path $ pandoc -s -o pandoc_test.odt -f markdown_github -t odt pandoc_test.md

When trying to open it in LibreOffice, one is greeted with an error message saying Read-Error. Format error discovered in the file in sub document content.xml at 34,37. This is the line found at 34,37:

<text:p text:style-name="P8">[ERROR] �Host '192.168.1.200' is not allowed to connect to this MySQL server</text:p>

Pandoc version from Ubuntu 16.04:

pandoc -v
pandoc 1.16.0.2
Compiled with texmath 0.8.4.1, highlighting-kate 0.6.1.
Syntax highlighting is supported for the following languages:
    abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
    changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
    curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
    eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
    haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
    jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
    llvm, lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima,
    mediawiki, metafont, mips, modelines, modula2, modula3, monobasic, nasm,
    noweb, objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php,
    pike, postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest,
    rhtml, roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
    sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml, zsh
Default user data directory: /home/pshem/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

I'm including the created corrupted file:
pandoc_test.odt.zip

The text was updated successfully, but these errors were encountered:

mb21 · 2018-12-02T17:01:24Z

Use the newest pandoc version from https://github.com/jgm/pandoc/releases
Could you give us the actual input file? Best as a minimal example.

pshem · 2018-12-02T17:04:56Z

Sorry, I don't have the time to test if the bug happens in the newest pandoc release right now. Here is the minimal example file I used in the first post( embeding markdown into markdown was... not the best idea)
pandoc_test.md.zip

jgm · 2018-12-02T17:13:05Z

Your markdown file contains the control character EOT (0x04) right before the word "Host".
I don't know why. This is being passed through to the ODT, which apparently doesn't like it.
(verified with latest version)

jgm · 2018-12-02T17:15:19Z

Rendering the EOT as  makes no difference, still invalid. This character (and presumably other control characters) should be stripped.

jgm · 2018-12-02T17:19:15Z

I suppose we could modify escapeStringForXML so that it strips this character (and which others)? It may be, though, that the character is allowed in some XML formats. I'm inclined to do nothing, actually, since there's no clear reason to have an EOT character in a Markdown file; this is garbage in, garbage out, no?

mb21 · 2018-12-02T17:23:44Z

Agreed, if we start stripping unknown Unicode characters, we'll always be one step behind adding the newest emojis...

jgm · 2018-12-02T17:36:17Z

See related issue #5042

pshem · 2018-12-02T23:26:51Z

This character probably came to be because the snippet is from an interupted sqlmap run. It would be nice to have a warning about potentially invalid characters though - I think trying to convert to PDF directly errored out in pdflatex and that's why I tried converting to odt in the first place.

Ideally, it would be nice to have a list of characters invalid for each format, but that seems like a lot of effort for little gain. I don't think emoji are likely to break quite as many formats as control characters will. A generic warning for "Contains control characters, might break some formats" shown only when running verbose could work too. It's fair if you want to close this as won't-fix

mb21 · 2018-12-03T08:18:14Z

One thing we could do is to emit warnings when we have non-printable / invisible characters in the input stream. We could even add the option to use a custom whitelist, so people can choose whether to get a warning on e.g. \r or not.

When using --pdf-engine=pdflatex, the whitelist could be restricted to ASCII. And broadened when using XeLaTeX.

quasicomputational · 2018-12-03T17:32:57Z

U+0004 isn't a legal XML character anyway, so pandoc's generating ill-formed XML. pandoc definitely ought to be able to detect this case, though I'm not sure whether it should be an error, a warning + passthrough because GIGO, or a warning + stripping.

Here's the spec on what characters are allowed: note that even presently-unassigned codepoints are legal, so random new emoji getting added isn't a problem.

Some formats might place additional restrictions on what characters are allowed and those might be more troublesome to check for, especially if they're not written with future versions of Unicode in mind. I think it would make sense to cross that bridge when we come to it.

jgm · 2018-12-04T17:15:47Z

Copied here for convenience:

Char | ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF]

I think the easiest thing to do would be to change escapeStringForXML in Text.Pandoc.XML so that it silently strips out nonconforming characters. If we wanted to add a warning, we could, but this would require completely changing the type of escapeStringForXML and revising a lot of other code.

Closes jgm#5119.

pshem changed the title ~~Created ODT documetns can contain invalid characters~~ Created ODT documents can contain invalid characters Dec 2, 2018

jgm closed this as completed in 38200c0 Dec 4, 2018

daamien pushed a commit to daamien/pandoc that referenced this issue Dec 27, 2018

Strip out illegal XML characters in escapeXMLString.

f2a57e6

Closes jgm#5119.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Created ODT documents can contain invalid characters #5119

Created ODT documents can contain invalid characters #5119

pshem commented Dec 2, 2018

mb21 commented Dec 2, 2018

pshem commented Dec 2, 2018

jgm commented Dec 2, 2018 •

edited

Loading

jgm commented Dec 2, 2018

jgm commented Dec 2, 2018

mb21 commented Dec 2, 2018

jgm commented Dec 2, 2018

pshem commented Dec 2, 2018

mb21 commented Dec 3, 2018 •

edited

Loading

quasicomputational commented Dec 3, 2018

jgm commented Dec 4, 2018

Created ODT documents can contain invalid characters #5119

Created ODT documents can contain invalid characters #5119

Comments

pshem commented Dec 2, 2018

MySQL

mb21 commented Dec 2, 2018

pshem commented Dec 2, 2018

jgm commented Dec 2, 2018 • edited Loading

jgm commented Dec 2, 2018

jgm commented Dec 2, 2018

mb21 commented Dec 2, 2018

jgm commented Dec 2, 2018

pshem commented Dec 2, 2018

mb21 commented Dec 3, 2018 • edited Loading

quasicomputational commented Dec 3, 2018

jgm commented Dec 4, 2018

jgm commented Dec 2, 2018 •

edited

Loading

mb21 commented Dec 3, 2018 •

edited

Loading