Pic2NewickTree

This is a research collaboration between Luna (B. O'Meara lab, Ecology Evolutionary Biology Department, UTK) and Moon (Computational Biology Lab, UTK).

Final product

A database for managing and retrieving published phylogenetic trees. To this aim, several steps are needed:

A collection of candidate publications (in PDF format) with phylogenetic tree figures. 2.1 All figures are identified, cropped out from the original paper and converted into picture in a unified format(PNG). 2.2 Figures with no phylogenetic trees are discarded.
Information from tree images is extracted and stored into a computer-readable format (Newick).
All the above information is organized into a database.

Approach A: Manually-engineered steps for pictures with standard presentations

design phase

synthesize data
design model (attention mechanism with simple filters)
test accuracy on synthesized data

test phase

(0. label data manually)

test accuracy on real data
analyze error case by case
add variations to the model

evaluate prediction confidence

the evaluation of prediction accuracy for unlabeled data is needed so that we can pick out problematic pictures and manually improve the database.

find out features which prediction performence is sensitive to (for example, to predict age of a person, it's easier to tell if the person is a female rather than male. Another example, it's easier to predict nationality between Chinese and British, but harder between Chinese and Japanese)
design a specific model for the above features
softmax and cross-entropy may be helpful to evaluate the prediction confidence
comparing the unsupervised clustering results of raw picutures and the extracted code or generated pictures might be helpful

Approach B: Deep learning model with standard presentations

Deep learning method for this project is mainly based on image caption architecture which is a hybrid between CNN (convolutional neural network) for image feature extraction and RNN (recurrent neural network) for generating language which is newick code here.

acquire data

(0. same labeled data)

list of candidate papers
number of phylogenetic tree figures in each candidate paper
coordinates and size of each figure (x, y, h, w)
list of extracted images

model construction

preprocess data: get grayscale and extract species names with 100% accuracy 2.1 build architecture 2.2 design cost function
train and fine tune

presentation

generate standard pictures with predicted newick code

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
1_Pdf2Pic		1_Pdf2Pic
2_Pic2TreePic		2_Pic2TreePic
ClassifyGUI		ClassifyGUI
Models		Models
train		train
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pic2NewickTree

Final product

Approach A: Manually-engineered steps for pictures with standard presentations

design phase

test phase

evaluate prediction confidence

Approach B: Deep learning model with standard presentations

acquire data

model construction

presentation

About

Releases

Packages

Languages

License

moon-home/Pic2NewickTree

Folders and files

Latest commit

History

Repository files navigation

Pic2NewickTree

Final product

Approach A: Manually-engineered steps for pictures with standard presentations

design phase

test phase

evaluate prediction confidence

Approach B: Deep learning model with standard presentations

acquire data

model construction

presentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages