This repository has been archived by the owner on Dec 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 148
Work directory organization
Jerome Flesch edited this page Nov 9, 2016
·
4 revisions
workdir|rootdir = ~/papers
In the work directory, you have folders, one per document.
The folder names are (usually) the scan/import date of the document: YYYYMMDD_hhmm_ss[_<idx>]. The suffix 'idx' is optional and is just a number added in case of name collision.
In every folder you have:
- For image documents:
- paper.<X>.jpg : A page in JPG format (X starts at 1)
- paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR (optional, but required for indexing ; can be regenerated with the options "Redo OCR (...)").
- paper.<X>.thumb.jpg (optional, generated automatically) : A thumbnail version of the page (faster to load)
- labels (optional) : a text file containing the labels applied on this document
- extra.txt (optional) : extra keywords added by the user
- For PDF documents:
- doc.pdf : the document
- labels (optional) : a text file containing the labels applied on this document
- extra.txt (optional) : extra keywords added by the user
- paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR. Some PDF contains crap instead of the real text, so running the OCR on them can sometimes be useful.
Here is an example a work directory organisation:
$ find ~/papers
/home/jflesch/papers
/home/jflesch/papers/20130505_1518_00
/home/jflesch/papers/20130505_1518_00/paper.1.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.words
/home/jflesch/papers/20130505_1518_00/paper.2.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.words
/home/jflesch/papers/20130505_1518_00/paper.3.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.words
/home/jflesch/papers/20130505_1518_00/labels
/home/jflesch/papers/20110726_0000_01
/home/jflesch/papers/20110726_0000_01/paper.1.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.words
/home/jflesch/papers/20110726_0000_01/paper.2.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.words
/home/jflesch/papers/20110726_0000_01/extra.txt
/home/jflesch/papers/20130106_1309_44
/home/jflesch/papers/20130106_1309_44/doc.pdf
/home/jflesch/papers/20130106_1309_44/paper.1.words
/home/jflesch/papers/20130106_1309_44/paper.2.words
/home/jflesch/papers/20130106_1309_44/labels
/home/jflesch/papers/20130106_1309_44/extra.txt
With Tesseract, the hOCR file can be obtained with following command:
tesseract paper.<X>.jpg paper.<X> -l <lang> hocr && mv paper.<X>.html paper.<X>.words
For example:
tesseract paper.1.jpg paper.1 -l fra hocr && mv paper.1.html paper.1.words
Here is an example of content of a label file:
facture,#0000b1588c61
logement,#f6b6ffff0000
It's always [label],[color]. For a same label, the color should always be the same.