This mini-project deals with extracting test questions from printed books using OCR (image to text) and custom parser (text to semantic data, in this case test questions).
It uses Google Cloud Vision API for OCR. 👀
All code is written the in TypeScript.
👉 See also my other project memorio that uses the results created in this mini-project.
- Node.js >=18.x
- Yarn 1.x
- optional: globally installed nodemon for rerunning scripts on source changes
- Install all dependencies with Yarn (run
yarn
).
There are 4 scripts that implements the full pipeline from a book scan in a PDF to a machine-readable data (a collection of questions and categories, including all metadata such as numbering and correct answers).
The scripts were developed specifically for extracting the test questions from the book Modelové otázky z biologie k přijímacím zkouškám na 1. lékařskou fakultu Univerzity Karlovy v Praze, verze 2011. But they can be easily adapted to other similar use-cases too.
Note 1: The input PDFs are NOT published in this repository. However, the example output is and can be found here.
Note 2: Instead of nodemon
, you can use node
directly.
Note 3: If the input PDF is scanned book where each page contains an image of two real pages (an open book),
it is better to manually split the images in the middle (e.g. using this online free
service Split two-page layout scans to create separate PDF pages)
before running the OCR using run-ocr.ts
script.
-
Calls Google Cloud Vision API asyncBatchAnnotate (see also the official guide).
The PDF (image scan)
{fileName}
must be stored in a GCS bucket{bucketName}
. The conversion result is a set of JSON files (one file for each 20 pages) that are stored in{outputPrefix}
in the same bucket.The script waits until the conversion finishes, and then it prints the output info.
An example:
nodemon -r ./register.js scripts/run-ocr.ts \ testbook-ocr \ test/Modelovky_Biologie_1LF_2011.pdf \ results/Modelovky_Biologie_1LF_2011
The script source code can be found in scripts/run-ocr.ts.
-
Takes the resulting JSON files from the first script and extracts the text. The input JSON files must in
{ocrOutputDir}
(on local filesystem). The output is placed in{pagesDir}
(on local filesystem). The output is a set ofpage-XXXX.txt
files that contain the text of the corresponding pages.An example:
nodemon -r ./register.js -i 'data/' scripts/post-process.ts \ data/modelovky-biologie-1lf-2011/ocr-output/ \ data/modelovky-biologie-1lf-2011/pages-original/
The script source code can be found in scripts/post-process.ts.
-
This script implements a use-case-specific semantic parser that turns the raw text pages into the machine-readable data (questions, categories).
It takes the output of the second script (which is in
{pagesDir}
) and creates a collection of JSON files (onecategories.json
and per-pagepage-XXXX.json
that contains questions from the corresponding page).When the parser encounters an unexpected token, it stops and prints the detailed information (page and line) where the error occurred. This allows of manual correction of the OCR text output files. The parsing can be rerun many times (after each correction) until there are no errors and all outputs are created.
An example:
nodemon -r ./register.js -i 'data/*/questions/' scripts/parse-questions.ts \ data/modelovky-biologie-1lf-2011/pages/ \ data/modelovky-biologie-1lf-2011/questions/
The script source code can be found in scripts/parse-questions.ts.
-
Takes the parsed questions and categories from the third script (which are in
{questionsDir}
) and transforms them to the format that can be used in memorio app.An example:
nodemon -r ./register.js -i 'data/*/memorio/' scripts/memorio-transform.ts \ data/modelovky-biologie-1lf-2011/questions/ \ data/modelovky-biologie-1lf-2011/memorio/
The script source code can be found in scripts/memorio-transform.ts.
- What Unicode character is this?
- Unicode Slide Show
- more Unicode tools: https://babelstone.co.uk/Unicode/