Skip to content

leyhline/bukan-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detection of Differences between Printed Pages and its Application on Bukan

This is a project I worked on during my internship at the National Institute of Informatics in Tokyo, Japan. Essentially, it is about the application of classical Computer Vision on historic japanese literature, thus classifying as Digital Humanitites.

Bukan are books from Japan's Edo Period (1603-1868), listing people of influence together with crests, family trees etc. These books were bestsellers, printed using woodblocks. As a result, there is a large number of prints and editions, hiding away potential useful information for the humanities scholar. To lessen the burden on the human researcher a computer might help in comparing the pages, showing recommendations and visualizations.

By utilizing proved techniques from Computer Science and Computer Vision—most notably Feature Detection and RANSAC—a database is populated with matching pages between different prints. This approach has an accuracy of above 95% when looking for the same page in a different print of a book. Furthermore, this can be used for creating a nice-looking overlay of a page-pair, thus resulting in a useful visualization to quickly discern the differences.

For more details, just have a look at my Internship Report where I also included a few graphics and examples.

Structure of this Repository

I used this repository mainly for running experiments. Obviously, I was not quite sure about my approach, its performance and the usefulness of the results, which reflects in the messy structure and the large number of Jupyter notebooks.

  • 00-data-validation.ipynb to 08-pipeline-multiprocessing.ipynb are Jupyter notebooks I created as I did my experiments. In general the later the more advanced.
  • annotations/ are—as the name implies—annotations I created manually for the Shuuchin Bukan volumes. I wrote down the offset between pages so I had a ground truth to test against.
  • helpers/ are just some Python functions I used in the notebooks.
  • report/ and slides/ are for documentation purpse during and at the end of my internship.
  • static/ and templates/ are for the Flask webserver server.py
  • schema.sql is for creating a SQL database to put the data into. This first database design is mediocre.
  • There are also multiple *.csv files with metadata for the Bukan Collection
  • There are two data folders that do not exist in this repository:
    • data/ where the original data from the CODH website is saved by downloadcollection.py
    • output/ where processed data is saved

Dependencies

First you need to download the data. Next you need to install the Python dependencies (recommended: in a virtualenvironment).

Downloading the Dataset: Bukan Collection

The data is publicly available on the servers of the Center of Open Data in the Humanities. I have a small dirty script prepared for downloading everything (around 200GB). So it will take some time depending on your internet connection. Just run:

python3 downloadcollection.py

Then wait for some hours and pray it will not get interrupted since I do not catch this. (but a logfile is created: downloadcollection.log) The data is stored in a newly created data folder.

Installing Python Dependencies

This is optional but I recommend creating a Virtual Environment first. There are multiple ways to do this (pipenv, conda…) but basically it is unter Linux:

python3 -m venv venv
source venv/bin/activate

Next, you can simply install all the dependencies (mostly scientific libraries) via:

pip3 install -r requirements.txt

Now you are ready to go to run the code, preferrably by opening a Jupyter Notebook:

jupyter notebook

Webserver

There is also code for a simple demo application using Flask. This depends on some data that is created by running the notebooks and I am currently not sure how to release this data since it is quite a lot. I hope I can find ressources for running the application myself somewhere. Nevertheless, this is the code for starting the development webserver:

export FLASK_APP=server.py
export FLASK_ENV=development  # optional
flask run

About

Doing Analysis of scanned old Japanese Books

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages