arafix_ocr

Introduction

This tool improves the output of generic OCR systems by utilizing an n-gram based post-correction approach. While most techniques that seek to improve Arabic OCR output focus on the Computer Vision aspect of converting image to text, the post correction module in our tool focuses on improving the output of OCR systems without any knowledge of the image and relying completely on the OCR text.

In addition to the post-correction system, this repo contains modules that:

Utilize an external OCR API to convert image to text
Convert images into embedded/searchable PDFs
Evaluate the quality of results of the system on the word level (when the ground truth is known)

Installation Guide

You can view the video guide here or follow the below instructions.

Download srilm: Navigate to this link and download version of 1.7.3 of srilm into the main directory of this repo
Inside configs/default.txt, add the api key in line 2
Download models from this link and put them inside the models subfolder. To start off, download the msa_5m.lm only
Run the following command:

sh install.sh

The previous commands will install all the required dependencies for arafix. The tool should be ready to use!

Usage

You can view the video guide here or follow the below instructions.

Arafix has 3 main modules:

image_to_text: converts image to text and searchable pdf
predict: improves the generated text from previous module
evaluate: evaluates the word error rate of outputs from the first and second modules

IMPORTANT NOTE: As of now, the predict module is not ready for use. Thus, all related commands in arafix have been commented out. Given the current functionality, a user can scan an image and evaluate its accuracy but the output cannot be improved using the predict module.

To run arafix, do the following:

Open the data folder and create a subfolder with the name of the book you intend to run arafix on
Within the book's subfolder, create a subfolder named <book_name>_raw_images
Within the <book_name>_raw_images subfolder, add all the images you wish to scan. To start off, you can download pages of a specific ACO book from this link
Optionally, if you intend to perform evaluation of your result (only if you have ground truth), create another subfolder within the <book_name> folder called <book_name>_ground_truth. This folder should contain the ground truth text files for your book (one file for every page).
Open code/arafix.sh in a text editor
Modify variables* as needed
Open terminal and navigate to arafix_ocr
Now run the following command

sh arafix.sh <book_name> <names of modules (image_to_text, predict and evaluate) to run separated by space>

For example, if you would like to run the image_to_text and evaluate module on sample_book, the command would look like as follows:

sh arafix.sh sample_book image_to_text evaluate

This example command is also the suggested command since image_to_text and evaluate modules work perfectly fine but predict module might not generate the desired results.

*arafix.sh variables:

config_name: which config file should arafix read the settings from
start_page: which page should arafix start running from. Set it to "None" to run it from the lowest possible page.
end_page: which page should arafix run till (inclusive). Set it to "None" to run till the highest possible page.

[Optional] Configuration

You can view the video guide here or follow the below instructions.

Arafix runs with the default settings as described in the configs/default.txt file. If you wish to modify these settings, do the following:

Create a copy of default.txt and modify the parameters within this file
Update arafix.sh with the name of the new config file in config_name variable.

Configuration Parameters:

image_to_text
- api_key: Obtained from ocr.space, a commercial scanning software
- skip_converted (True/False): If true, it skips files which have already been converted to save API calls
- create_pdf (True/False): If true, it creates searchable pdfs as well
predict
- map_name (check mappings directory for possible options): Different mapping files are used to fix different kinds of errors
- model_name (openiti_5m.lm, openiti_70m.lm, msa_5m.lm, msa_70m.lm): openiti models are based on islamic data while msa models are based on novels. 5m versions are quicker but less accurate
- order (1-8): the tool performs best at order 8. if it's lower, it will take lesser time to execute at the cost of reduced accuracy
- keep_scratch (True/False): if set to False, it deletes all the scratch files generated during prediction
- create_pdf (True/False): if set to true, it will use fix the errors in the searchable pdf created in the previous module.
evaluate

select the next 3 parameters depending on the results you would like to evaluate (e.g. if you fixed errors using order 8 and segmenter mapping then select the same in this step to carry out its evaluation)
- map_name
- model_name
- order
- keep_scratch (True/False): if set to False, it deletes all the scratch files generated during evaluation

Technical Documentation

Versioning: As for python, version 3.8.3 was used. utils/dependencies.txt and install.sh contain the required packages with their respective versions.
Models: The models were built using the ngram-count function in the SRILM toolkit. The following specfications were used:
- Order: 8
- Smoothing: Kneser-Ney
- keep-unk: True
arafix.sh: This bash script is the main function to be executed. It calls the 3 main modules of arafix tool. arafix_dalma.sh provides the same code but with dalma compatibility
image_to_text.py: This module uses the OCR Space API to convert images into text. It also stores relevant JSON info of the OCR'ed files. The settings for the API calls can be modified within ocr_space_func()
predict.py: This module does the following:
- encode the input text as follows:
  - start of word (ABC -> A#)
  - middle of word ( ABC -> #B#)
  - end of word (ABC -> #C)
  - independent letter (A -> A) Example: I am an apple -> I a# #m a# #n a# #p# #p# #l# #e
- pass the encoded text to disambig function of SRILM toolkit which determines if any token needs to be changed based on the tokens that preceed it (n-gram)
- decode the text outputted by disambig as follows: A# #r# #a# #f# #i# #x O# #C# #R-> Arafix OCR

Note: occasionally, the predicted output for a line will contain an impossible scenario such as: A #l# #a. Here, the 'A' token says that it is completely independent (A la), but the '#l#' that follows it says that it should be connected to the 'A' (Ala). In these cases, the default decoding decision is to split the word.

evaluate.py: This module uses ced_word_alignment tool to align the ground truth against ocr and predicted. Then it calculates word error rate using the following formula: (subs + deletions + insertions) / (subs + deletions + correct words)

Feedback

Report any issues with the tool here
Feel free to contribute to the tool and add new features using pull requests

License

arafix_ocr is available under the Apache License, Version 2.0. See the LICENSE file for more info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

arafix_ocr

Introduction

Installation Guide

Usage

[Optional] Configuration

Technical Documentation

Feedback

License

About

Licenses found

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
code		code
configs		configs
data		data
mappings		mappings
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
arafix.sh		arafix.sh
install.sh		install.sh
license.md		license.md

License

Licenses found

CAMeL-Lab/arafix_ocr

Folders and files

Latest commit

History

Repository files navigation

arafix_ocr

Introduction

Installation Guide

Usage

[Optional] Configuration

Technical Documentation

Feedback

License

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages