Historical documents can reveal a great deal of information about our past, such as, form of writing, wording, content that did not exist and more. In order to perform computational learning (Machine Learning) a huge amount of classified data (Classified Data) is needed. The process of creating classified data (Annotations) is expensive and tedious work, and therefore in the field of historical documents, the databases that exist for training models are small. These datasets do not allow training deep models to get high results.
In order to create a large database of data, in an easy way that requires less resources, it is necessary to create synthetic data. In the this project, we researched a method for creating synthetic historical data and developed a system (website) that allows each user to synthesize documents himself.
Our method is a deep learning method based on neural style transfer. In order to improve the results of the method, we used several techniques of computer vision, such as Binarization, Dilation and Image Processing.
This Project was created with Python, FastAPI, TensorFlow, Keras, OpenCV, Angular, Bootstrap and more libraries.
In order to understand the steps and what we did you are welcome to look at the Project Book.
In order to run this project with docker your environment needs to support TensorFlow Docker. you can follow this link to get everything set settled.
- Clone this repository.
- Open cmd/shell/terminal and go to application folder:
cd Hstyle/app
- Run the docker-compose file:
docker-compose -f docker-compose-local.yml up
- Open this link
- Enjoy the application.
- Clone this repository.
- Open the following file:
Hstyle/app/client/src/environments/environment.prod.ts
- In the opened file from step 2 change the API_URL to 'http://PRODUCTION_IP_ADDRESS:5000' where PRODUCTION_IP_ADDRESS is your deployment server IP address.
- Open cmd/shell/terminal and go to application folder:
cd Hstyle/app
- Run the docker-compose file: `docker-compose -f docker-compose-prod.yml up``
- Open this link http://PRODUCTION_IP_ADDRESS:3000/ where PRODUCTION_IP_ADDRESS is your deployment server IP address.
- Enjoy the application.
In order to evaluate and determine which technique is best from 3 techniques, which we thought have the best results (Original content image, Dilate content image, Binary content image), we performed a survey of 50 participants and asked them to rate image readability and image historical look, 1-being the lowest (poor) and 5-being the highest (great).
As we can see, ‘dilate content image’ and ‘binary content image’ get the highest amount of votes for rate three and above, meaning, these results have the highest readability.
As we can see, ‘dilate content image’ gets the highest amount of votes for rate three and above, meaning, these results have the most historical look.