This is a research collaboration between Luna (B. O'Meara lab, Ecology Evolutionary Biology Department, UTK) and Moon (Computational Biology Lab, UTK).
A database for managing and retrieving published phylogenetic trees. To this aim, several steps are needed:
- A collection of candidate publications (in PDF format) with phylogenetic tree figures. 2.1 All figures are identified, cropped out from the original paper and converted into picture in a unified format(PNG). 2.2 Figures with no phylogenetic trees are discarded.
- Information from tree images is extracted and stored into a computer-readable format (Newick).
- All the above information is organized into a database.
- synthesize data
- design model (attention mechanism with simple filters)
- test accuracy on synthesized data
(0. label data manually)
- test accuracy on real data
- analyze error case by case
- add variations to the model
the evaluation of prediction accuracy for unlabeled data is needed so that we can pick out problematic pictures and manually improve the database.
- find out features which prediction performence is sensitive to (for example, to predict age of a person, it's easier to tell if the person is a female rather than male. Another example, it's easier to predict nationality between Chinese and British, but harder between Chinese and Japanese)
- design a specific model for the above features
- softmax and cross-entropy may be helpful to evaluate the prediction confidence
- comparing the unsupervised clustering results of raw picutures and the extracted code or generated pictures might be helpful
Deep learning method for this project is mainly based on image caption architecture which is a hybrid between CNN (convolutional neural network) for image feature extraction and RNN (recurrent neural network) for generating language which is newick code here.
(0. same labeled data)
- list of candidate papers
- number of phylogenetic tree figures in each candidate paper
- coordinates and size of each figure (x, y, h, w)
- list of extracted images
- preprocess data: get grayscale and extract species names with 100% accuracy 2.1 build architecture 2.2 design cost function
- train and fine tune
- generate standard pictures with predicted newick code