Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created ODT documents can contain invalid characters #5119

Closed
pshem opened this issue Dec 2, 2018 · 11 comments
Closed

Created ODT documents can contain invalid characters #5119

pshem opened this issue Dec 2, 2018 · 11 comments

Comments

@pshem
Copy link

pshem commented Dec 2, 2018

This is the code that made pandoc create an invalid ODT document.

MySQL

root@kali:~# hydra -e nsr -l root -P /usr/share/wordlists/metasploit/unix_passwords.txt mysql://192.168.1.10/ -I
Hydra v8.6 (c) 2017 by van Hauser/THC - Please do not use in military or secret service organizations, or for illegal purposes.

Hydra (http://www.thc.org/thc-hydra) starting at 2018-11-21 12:44:18
[INFO] Reduced number of tasks to 4 (mysql does not like many parallel connections)
[DATA] max 4 tasks per 1 server, overall 4 tasks, 1012 login tries (l:1/p:1012), ~253 tries per task
[DATA] attacking mysql://192.168.1.10:3306/
[ERROR] �Host '192.168.1.200' is not allowed to connect to this MySQL server

It was converted with

pshem@Computer /path $ pandoc -s -o pandoc_test.odt -f markdown_github -t odt pandoc_test.md

When trying to open it in LibreOffice, one is greeted with an error message saying Read-Error. Format error discovered in the file in sub document content.xml at 34,37. This is the line found at 34,37:

<text:p text:style-name="P8">[ERROR] �Host '192.168.1.200' is not allowed to connect to this MySQL server</text:p>

Pandoc version from Ubuntu 16.04:

pandoc -v
pandoc 1.16.0.2
Compiled with texmath 0.8.4.1, highlighting-kate 0.6.1.
Syntax highlighting is supported for the following languages:
    abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
    changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
    curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
    eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
    haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
    jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
    llvm, lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima,
    mediawiki, metafont, mips, modelines, modula2, modula3, monobasic, nasm,
    noweb, objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php,
    pike, postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest,
    rhtml, roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
    sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml, zsh
Default user data directory: /home/pshem/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

I'm including the created corrupted file:
pandoc_test.odt.zip

@pshem pshem changed the title Created ODT documetns can contain invalid characters Created ODT documents can contain invalid characters Dec 2, 2018
@mb21
Copy link
Collaborator

mb21 commented Dec 2, 2018

  1. Use the newest pandoc version from https://github.com/jgm/pandoc/releases
  2. Could you give us the actual input file? Best as a minimal example.

@pshem
Copy link
Author

pshem commented Dec 2, 2018

Sorry, I don't have the time to test if the bug happens in the newest pandoc release right now. Here is the minimal example file I used in the first post( embeding markdown into markdown was... not the best idea)
pandoc_test.md.zip

@jgm
Copy link
Owner

jgm commented Dec 2, 2018

Your markdown file contains the control character EOT (0x04) right before the word "Host".
I don't know why. This is being passed through to the ODT, which apparently doesn't like it.
(verified with latest version)

@jgm
Copy link
Owner

jgm commented Dec 2, 2018

Rendering the EOT as &#4; makes no difference, still invalid. This character (and presumably other control characters) should be stripped.

@jgm
Copy link
Owner

jgm commented Dec 2, 2018

I suppose we could modify escapeStringForXML so that it strips this character (and which others)? It may be, though, that the character is allowed in some XML formats. I'm inclined to do nothing, actually, since there's no clear reason to have an EOT character in a Markdown file; this is garbage in, garbage out, no?

@mb21
Copy link
Collaborator

mb21 commented Dec 2, 2018

Agreed, if we start stripping unknown Unicode characters, we'll always be one step behind adding the newest emojis...

@jgm
Copy link
Owner

jgm commented Dec 2, 2018

See related issue #5042

@pshem
Copy link
Author

pshem commented Dec 2, 2018

This character probably came to be because the snippet is from an interupted sqlmap run. It would be nice to have a warning about potentially invalid characters though - I think trying to convert to PDF directly errored out in pdflatex and that's why I tried converting to odt in the first place.

Ideally, it would be nice to have a list of characters invalid for each format, but that seems like a lot of effort for little gain. I don't think emoji are likely to break quite as many formats as control characters will. A generic warning for "Contains control characters, might break some formats" shown only when running verbose could work too. It's fair if you want to close this as won't-fix

@mb21
Copy link
Collaborator

mb21 commented Dec 3, 2018

One thing we could do is to emit warnings when we have non-printable / invisible characters in the input stream. We could even add the option to use a custom whitelist, so people can choose whether to get a warning on e.g. \r or not.

When using --pdf-engine=pdflatex, the whitelist could be restricted to ASCII. And broadened when using XeLaTeX.

@quasicomputational
Copy link
Contributor

U+0004 isn't a legal XML character anyway, so pandoc's generating ill-formed XML. pandoc definitely ought to be able to detect this case, though I'm not sure whether it should be an error, a warning + passthrough because GIGO, or a warning + stripping.

Here's the spec on what characters are allowed: note that even presently-unassigned codepoints are legal, so random new emoji getting added isn't a problem.

Some formats might place additional restrictions on what characters are allowed and those might be more troublesome to check for, especially if they're not written with future versions of Unicode in mind. I think it would make sense to cross that bridge when we come to it.

@jgm
Copy link
Owner

jgm commented Dec 4, 2018

Copied here for convenience:

Char | ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF]

I think the easiest thing to do would be to change escapeStringForXML in Text.Pandoc.XML so that it silently strips out nonconforming characters. If we wanted to add a warning, we could, but this would require completely changing the type of escapeStringForXML and revising a lot of other code.

@jgm jgm closed this as completed in 38200c0 Dec 4, 2018
daamien pushed a commit to daamien/pandoc that referenced this issue Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants