Skip to content

Improved document conversion

Louise Crow edited this page Jul 23, 2013 · 10 revisions

Current setup

Office and PDF documents are currently converted to HTML, using wvHtml, xlhtml, pdftohtml, and unrtf. Other formats are not currently supported for inline viewing / indexing.

The documents are converted when the link to the ("as HTML") is followed. Converted text is stored on the filesystem by way of a cache, so subsequent link clients will just directly serve up the document.

The system works pretty well, except for some occasional bugs in the conversion software, which cause hangs or empty HTML versions of the software. It doesn't particularly need upgrading as it works well enough; this page is just to record a possible improvement.

Problems

  • Conversion software bugs leading to corrupt data
  • Conversions not pretty (e.g. preserving images in documents nicely, etc -- see samples below)
  • Quite limited range of supported source conversions (but 99% of those used are supported, i.e. doc and pdf)
  • No ability to annotate inline

Alternative

There's an alternative system used by US FOI site MuckRock, which displays the documents in a nice viewer.

Their system uses the (currently free) journalist source document system documentcloud.org. The software to do this is open source, and available at https://github.com/documentcloud. Alternatively, we could use the DocumentCloud service, which (currently, at least) is free.

See also the discussion on the Alaveteli dev mailing list

The main components are:

  • docsplit, a ruby frontend for OpenOffice (document conversion), Tesseract (OCR), pdftk (split single PDF into one-per page), graphicsmagick (thumbails/images of pages)
  • DocumentViewer from NYT. Most importantly, supports annotations on the document (e.g. this senate bill)

Benefits:

  • Much nicer-looking conversions
  • Reasonably good interface for navigating around documents
  • OCRed text wherever text extraction not possible
  • Full support for all supported OpenOffice formats
  • All documents converted to PDF as part of process
  • Annotations possible
  • Could use the DocumentCloud service, thus dramatically reducing maintenance and hosting overheads
  • Thumbnails of documents suitable for including in request thread (see below)

Presentation of a document as a thumbnail within a request thread, with download links below, and a summary next to itPresentation of a document as a thumbnail within a request thread, with download links below, and a summary next to it

Comparison of OpenOffice (left) and wvHtml conversionsComparison of OpenOffice (left) and wvHtml conversions

Drawbacks:

  • Requires new code
  • If we implement locally, likely to be higher processing overheads (needs a running headless OpenOffice, and always requires PDF extraction step)
  • Is it indexable by search engines? The NYT blog post promises to fix this.
Clone this wiki locally