This repository contains:
- A web-crawler (based on the Scrapy Python library) to download pictures of aircraft from various websites.
- An OpenIMAJ project to train an image classifier with traditional Machine Learning technics (along with utility classes to process the output of the crawler).
- A (more advanced) Theano script to train a Convolutional Neural Network (CNN) image.
- A minimalist Python web server to host the CNN classifier
You will need to download and install Python 2.7, or preferably a scientific Python distribution, such as Anaconda.
Also required is the Java 1.8 Development Kit.
Install Scrapy:
pip install scrapy
Run the crawler (cd into planespotter/scrapy
):
scrapy crawl airliners -o planes.json > log.txt
Note: This will download potentially millions of (small) pictures on your hard-drive, taking a lot of time. Performing this on a SSD will greatly speed-up the process
In Eclipse, import the openimaj_classifier
folder as an "Existing Project into Workspace".
Note: OpenIMAJ is based on Maven. The project has a great number of (probably unused) dependencies. So Maven will download a lot of libraries from the Internet to perform the first build (afterward it will be transparent).
There are three main classes in the project that you can run:
- tk.thebrightstuff.JsonProcessor: This utility class takes as input one (or more) json files created by the crawler, and reformats them into one single text file (required by both OpenIMAJ and Theano).
- tk.thebrightstuff.Sorter: This utility class processes an image folder created by the crawler (or a tar version of it) to create a more file-system-efficient folder structure (required by both OpenIMAJ and Theano).
- tk.thebrightstuff.AircraftApp: This class trains the image annotator. Various inputs are required, such as the path were the image folder is stored on the disk, and how many pictures should be used for the training. After training, all the data is saved in a
data.txt
file (which can be reloaded later), and the classifier is tested against a set of pictures.
Install theano:
pip install theano
To train the CNN on you GPU (much more efficient), you also need to have a good NVidia graphic card, and install Cuda and g++. On Windows you will probably need to install Visual Studio (See this post for an example of setup).
Depending on your settings, you will need to customize the .theanorc
file (in your home folder). An example is provided below (for Windows):
[global]
device = gpu
floatX = float32
exception_verbosity = high
compute_test_value = raise
[cuda]
root = C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5
[nvcc]
flags = --use-local-env --cl-version=2013 -LC:\Users\niluje\Anaconda\Lib;
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64
Run the script (cd into theano_conv_net
):
python Theano_aircraft.py
Note: The CNN will take a very long time to train, depending on your hardware, the size of the dataset and other settings that you can tune in the script.
The Theano script should save a model-values.save
file inside webapp/results
. You are ready to run the server!
Run the server (cd into webapp
):
python server.py
Note: Depending on your setup you may need to run the server as an administrator.
Visit the web application at localhost.
Note: If you want to run the server on a separate computer, you will need to install Python and Theano as well on this computer.