Detection of Differences between Printed Pages and its Application on Bukan

This is a project I worked on during my internship at the National Institute of Informatics in Tokyo, Japan. Essentially, it is about the application of classical Computer Vision on historic japanese literature, thus classifying as Digital Humanitites.

Bukan are books from Japan's Edo Period (1603-1868), listing people of influence together with crests, family trees etc. These books were bestsellers, printed using woodblocks. As a result, there is a large number of prints and editions, hiding away potential useful information for the humanities scholar. To lessen the burden on the human researcher a computer might help in comparing the pages, showing recommendations and visualizations.

By utilizing proved techniques from Computer Science and Computer Vision—most notably Feature Detection and RANSAC—a database is populated with matching pages between different prints. This approach has an accuracy of above 95% when looking for the same page in a different print of a book. Furthermore, this can be used for creating a nice-looking overlay of a page-pair, thus resulting in a useful visualization to quickly discern the differences.

For more details, just have a look at my Internship Report where I also included a few graphics and examples.

Structure of this Repository

I used this repository mainly for running experiments. Obviously, I was not quite sure about my approach, its performance and the usefulness of the results, which reflects in the messy structure and the large number of Jupyter notebooks.

00-data-validation.ipynb to 08-pipeline-multiprocessing.ipynb are Jupyter notebooks I created as I did my experiments. In general the later the more advanced.
annotations/ are—as the name implies—annotations I created manually for the Shuuchin Bukan volumes. I wrote down the offset between pages so I had a ground truth to test against.
helpers/ are just some Python functions I used in the notebooks.
report/ and slides/ are for documentation purpse during and at the end of my internship.
static/ and templates/ are for the Flask webserver server.py
schema.sql is for creating a SQL database to put the data into. This first database design is mediocre.
There are also multiple *.csv files with metadata for the Bukan Collection
There are two data folders that do not exist in this repository:
- data/ where the original data from the CODH website is saved by downloadcollection.py
- output/ where processed data is saved

Dependencies

First you need to download the data. Next you need to install the Python dependencies (recommended: in a virtualenvironment).

Downloading the Dataset: Bukan Collection

The data is publicly available on the servers of the Center of Open Data in the Humanities. I have a small dirty script prepared for downloading everything (around 200GB). So it will take some time depending on your internet connection. Just run:

python3 downloadcollection.py

Then wait for some hours and pray it will not get interrupted since I do not catch this. (but a logfile is created: downloadcollection.log) The data is stored in a newly created data folder.

Installing Python Dependencies

This is optional but I recommend creating a Virtual Environment first. There are multiple ways to do this (pipenv, conda…) but basically it is unter Linux:

python3 -m venv venv
source venv/bin/activate

Next, you can simply install all the dependencies (mostly scientific libraries) via:

pip3 install -r requirements.txt

Now you are ready to go to run the code, preferrably by opening a Jupyter Notebook:

jupyter notebook

Webserver

There is also code for a simple demo application using Flask. This depends on some data that is created by running the notebooks and I am currently not sure how to release this data since it is quite a lot. I hope I can find ressources for running the application myself somewhere. Nevertheless, this is the code for starting the development webserver:

export FLASK_APP=server.py
export FLASK_ENV=development  # optional
flask run

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
annotations		annotations
bukan		bukan
helpers		helpers
jadh2020		jadh2020
report		report
slides		slides
static		static
templates		templates
.gitignore		.gitignore
00-data-validation.ipynb		00-data-validation.ipynb
01-phash-evaluation.ipynb		01-phash-evaluation.ipynb
02-feature-detection.ipynb		02-feature-detection.ipynb
03-average.ipynb		03-average.ipynb
04-pipeline-improvements.ipynb		04-pipeline-improvements.ipynb
05-more-data.ipynb		05-more-data.ipynb
06-whole-dataset.ipynb		06-whole-dataset.ipynb
07-whole-dataset-2.ipynb		07-whole-dataset-2.ipynb
08-pipeline-revised.ipynb		08-pipeline-revised.ipynb
09-pipeline-schema2.ipynb		09-pipeline-schema2.ipynb
12-feature-detection.ipynb		12-feature-detection.ipynb
InternshipReport.pdf		InternshipReport.pdf
JADH2020Paper.pdf		JADH2020Paper.pdf
LICENSE		LICENSE
NOTES.org		NOTES.org
README.md		README.md
bukan-overview-extended.csv		bukan-overview-extended.csv
bukan-overview-final.csv		bukan-overview-final.csv
bukan-overview.csv		bukan-overview.csv
downloadcollection.py		downloadcollection.py
landscape_average.jpg		landscape_average.jpg
portrait_average.jpg		portrait_average.jpg
requirements.txt		requirements.txt
schema.sql		schema.sql
schema2.sql		schema2.sql
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detection of Differences between Printed Pages and its Application on Bukan

Structure of this Repository

Dependencies

Downloading the Dataset: Bukan Collection

Installing Python Dependencies

Webserver

About

Releases

Packages

Languages

License

leyhline/bukan-collection

Folders and files

Latest commit

History

Repository files navigation

Detection of Differences between Printed Pages and its Application on Bukan

Structure of this Repository

Dependencies

Downloading the Dataset: Bukan Collection

Installing Python Dependencies

Webserver

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages