Work directory organization

workdir|rootdir = ~/papers

Global organisation

In the work directory, you have folders, one per document.

The folder names are (usually) the scan/import date of the document: YYYYMMDD_hhmm_ss[_<idx>]. The suffix 'idx' is optional and is just a number added in case of name collision.

In every folder you have:

For image documents:
- paper.<X>.jpg : A page in JPG format (X starts at 1)
- paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR (optional, but required for indexing ; can be regenerated with the options "Redo OCR (...)").
- paper.<X>.thumb.jpg (optional, generated automatically) : A thumbnail version of the page (faster to load)
- labels (optional) : a text file containing the labels applied on this document
- extra.txt (optional) : extra keywords added by the user
For PDF documents:
- doc.pdf : the document
- labels (optional) : a text file containing the labels applied on this document
- extra.txt (optional) : extra keywords added by the user
- paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR. Some PDF contains crap instead of the real text, so running the OCR on them can sometimes be useful.

Here is an example a work directory organisation:

$ find ~/papers
/home/jflesch/papers
/home/jflesch/papers/20130505_1518_00
/home/jflesch/papers/20130505_1518_00/paper.1.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.words
/home/jflesch/papers/20130505_1518_00/paper.2.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.words
/home/jflesch/papers/20130505_1518_00/paper.3.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.words
/home/jflesch/papers/20130505_1518_00/labels
/home/jflesch/papers/20110726_0000_01
/home/jflesch/papers/20110726_0000_01/paper.1.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.words
/home/jflesch/papers/20110726_0000_01/paper.2.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.words
/home/jflesch/papers/20110726_0000_01/extra.txt
/home/jflesch/papers/20130106_1309_44
/home/jflesch/papers/20130106_1309_44/doc.pdf
/home/jflesch/papers/20130106_1309_44/paper.1.words
/home/jflesch/papers/20130106_1309_44/paper.2.words
/home/jflesch/papers/20130106_1309_44/labels
/home/jflesch/papers/20130106_1309_44/extra.txt

hOCR files

With Tesseract, the hOCR file can be obtained with following command:

tesseract paper.<X>.jpg paper.<X> -l <lang> hocr && mv paper.<X>.html paper.<X>.words

For example:

tesseract paper.1.jpg paper.1 -l fra hocr && mv paper.1.html paper.1.words

Label files

Here is an example of content of a label file:

facture,#0000b1588c61
logement,#f6b6ffff0000

It's always [label],[color]. For a same label, the color should always be the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work directory organization

Global organisation

hOCR files

Label files

Clone this wiki locally