This is a project I worked on during my internship at the National Institute of Informatics in Tokyo, Japan. Essentially, it is about the application of classical Computer Vision on historic japanese literature, thus classifying as Digital Humanitites.
Bukan are books from Japan's Edo Period (1603-1868), listing people of influence together with crests, family trees etc. These books were bestsellers, printed using woodblocks. As a result, there is a large number of prints and editions, hiding away potential useful information for the humanities scholar. To lessen the burden on the human researcher a computer might help in comparing the pages, showing recommendations and visualizations.
By utilizing proved techniques from Computer Science and Computer Vision—most notably Feature Detection and RANSAC—a database is populated with matching pages between different prints. This approach has an accuracy of above 95% when looking for the same page in a different print of a book. Furthermore, this can be used for creating a nice-looking overlay of a page-pair, thus resulting in a useful visualization to quickly discern the differences.
For more details, just have a look at my Internship Report where I also included a few graphics and examples.
I used this repository mainly for running experiments. Obviously, I was not quite sure about my approach, its performance and the usefulness of the results, which reflects in the messy structure and the large number of Jupyter notebooks.
00-data-validation.ipynb
to08-pipeline-multiprocessing.ipynb
are Jupyter notebooks I created as I did my experiments. In general the later the more advanced.annotations/
are—as the name implies—annotations I created manually for the Shuuchin Bukan volumes. I wrote down the offset between pages so I had a ground truth to test against.helpers/
are just some Python functions I used in the notebooks.report/
andslides/
are for documentation purpse during and at the end of my internship.static/
andtemplates/
are for the Flask webserverserver.py
schema.sql
is for creating a SQL database to put the data into. This first database design is mediocre.- There are also multiple
*.csv
files with metadata for the Bukan Collection - There are two data folders that do not exist in this repository:
data/
where the original data from the CODH website is saved bydownloadcollection.py
output/
where processed data is saved
First you need to download the data. Next you need to install the Python dependencies (recommended: in a virtualenvironment).
The data is publicly available on the servers of the Center of Open Data in the Humanities. I have a small dirty script prepared for downloading everything (around 200GB). So it will take some time depending on your internet connection. Just run:
python3 downloadcollection.py
Then wait for some hours and pray it will not get interrupted since I do not catch this. (but a logfile is created: downloadcollection.log
) The data is stored in a newly created data
folder.
This is optional but I recommend creating a Virtual Environment first. There are multiple ways to do this (pipenv, conda…) but basically it is unter Linux:
python3 -m venv venv
source venv/bin/activate
Next, you can simply install all the dependencies (mostly scientific libraries) via:
pip3 install -r requirements.txt
Now you are ready to go to run the code, preferrably by opening a Jupyter Notebook:
jupyter notebook
There is also code for a simple demo application using Flask. This depends on some data that is created by running the notebooks and I am currently not sure how to release this data since it is quite a lot. I hope I can find ressources for running the application myself somewhere. Nevertheless, this is the code for starting the development webserver:
export FLASK_APP=server.py
export FLASK_ENV=development # optional
flask run