This project has been derived from microsoft's LayoutLM project with dependency for transformers removed. It also includes support for 140 languages.This is a completed project for training and prediction of multilingual documents as there are limitations on labelled dataset kindly prepare data for your respective languages.I have currently tested it for hindi, malayalam, english combinations. I have released the training flow and model accordingly for these languages that have been trained on adhaar dataset
path and config file for multilingual bert model for producing embeddings
https://drive.google.com/drive/folders/1t5Ktz94YTSrE_JHdrfiPc4Moi-K4GxHz?usp=sharing
Training flow is in the train directory
Do alter and go through the parameters in the config.yml inside train directory to suit your requirements.
- clone the repository
- run pip install -r requirements.txt
- To train go to train folder and run python train.py after making changes in the config file
you should also download the pretrained model from the given link and place it in the folder models
similarly you should prepare the data in the format as in folder annotated_adhaar_data \ - To predict alter the config file outside the train folder and run python parser.py with the image path
after putting it in the parser.py file