-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathreadme.txt
147 lines (111 loc) · 7.19 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
This code is associated with the following manuscript. If you use any part of the source code, please cite us:
Sumaiya Nazeen, Yun William Yu, and Bonnie Berger*.
"Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads." (Accepted for publication in Genome Biology).
A preliminary version of the paper was presented at ECCB 2018 (Applications Track) and bioRxiv preprint can be found at https://doi.org/10.1101/375121
Upon publication, further information can be found at http://carnelian.csail.mit.edu/
0. Requirments
Vowpal Wabbit 8.1.1
scikit-learn
R 3.3.2
Python 2.7.13
BioPython 1.70
FragGeneScan
This code has been tested with GCC 6.3.0 on Ubuntu 17.04, running
under Bash 4.4.7(1) on a server with Intel Xeon E5-2695 v2 x86_64 2.40 GHz
processor and 320 GB RAM.
Using the EC-2010-DB dataset as gold standard, Carnelian can be comfortably
run on a machine with 16GB RAM using 1 CPU.
1. Directory structure
data/: EC-2010-DB dataset with gold standard EC labels.
scripts/: R scripts for abundance estimation and analysis from read counts in functional bins.
util/
ext/: external libararies.
test/: test drawfrag.c and fasta2skm.c
drawfrag.c: draw fragments from fasta records.
fasta2skm.c: construct feature (spaced k-mer profile), and convert to VW input format.
ldpc.py: generate LSH function using LDPC code.
sequtil.py: splitting and merging utilities for fasta files.
merge_pairs.py: links paired-end read files using paired-end relationships.
reduce.py: translate sequences using reduced amino acid alphabets.
kseq.h: parse FASTA files
tests/
demo_data/: data files required for unit and advanced tests.
basictest_carnelian.py: contains the unit tests for Carnelian.
advancedtest_carnelian.py: contains the end-to-end tests for Carnelian.
config.py: configures unit tests and advanced tests for Carnelian.
README.txt: contains the instructions to run the tests.
2. Install and test:
bash SETUP.sh
3. Usage:
Modes:
(default --optional-arguments such as k-mer length, fragment size,
hash functions, etc. are set to work best with EC-2010-DB as used in the manuscript.
If you're going to train on a different dataset, be sure to tune parameters.)
1) ./carnelian.py frag [--optional-arguments] test_dir frag_dir [-h]
Looks for a fasta file in test_dir with matching label file.
Randomly draws fragments of length and coverage specified in
optional-arguments. (use "./carnelian.py frag -h" for details)
Outputs these fragments with corresponding label into frag_dir.
2) ./carnelian.py train [--optional-arguments] train_dir model_dir [-h]
Looks for a fasta file in train_dir with matching label file.
For each batch of training, randomly draw fragments and generate
feature vectors using Opal LDPC hashes, and trains Vowpal_Wabbit
One-Against-All classifier against all batches sequentially. To train
classifiers in precise mode, use "--precise" option which will make
the learned model store probabilities.
Outputs the generated classifier model into model_dir.
3) ./carnelian.py retrain [--optional-arguments] old_model_dir new_model_dir new_exmpls_dir [-h]
Looks for a vowpal-wabbit model with patterns and dictionary file
in the old_model_dir and a fasta file with matching labels in the
new_exmpls_dir. Starting with the old model, it updates the existing
training model and merges new labels with old dictionary using the old
LDPC patterns. Note that a model trained in default mode must be updated
in default mode. Same is true for precise mode.
Output model, dictionary, and pattern files will be generated in new_model_dir.
4) ./carnelian.py translate [--optional-arguments] seq_dir out_dir fgsp_loc [-h]
Using FragGeneScan program located in the fgsp_loc directory, tries to find
coding sequences in the input reads fasta file in seq_dir, and translated the
coding sequences to possible ORFs outputting them in a fasta file in the out_dir.
5) ./carnelian.py predict [--optional-arguments] model_dir test_dir predict_dir [-h]
Looks for a classifier model in model_dir, and a fasta file in
test_dir containing reads/fragments. To make predictions with probabilities,
run in precise mode using "--precise" option and specify probability cutoff
using "--cutoff <X>" option.
Outputs the predictions in predict_dir as a fasta file with
corresponding a corresponding label file.
6) ./carnelian.py eval reference_file predicted_labels [-h]
Evaluation of prediction accuracy in terms of micro and macro averaged
precision, sensitivity, and F1-score. If run in "precise" mode, it will
assume predicted_labels file to have two tab-separated columns: <readID, predLabel>
7) ./carnelian.py abundance in_dir out_dir mapping_file gs_file [-h]
Generates abundance estimates of functional terms. Looks for predicted labels for
each sample in its own sub-directory in in_dir and sample mapping information
and average protein length per label in mapping_file and gs_file respectively. Please
note that, the sample ids must not start with digits.
Outputs raw counts and effective counts matrices in out_dir.
8) ./carnelian.py simulate [--optional-arguments] test_dir train_dir out_dir [-h]
Runs a full pipeline for performance evaluation starting from training on data
in train_dir, testing on data in test_dir, and outputting fragments, model and
predictions under out_dir in the following directory structure:
1frag/
simulated test data (drawn fragments) are saved here.
(ignored if --do-not-fragment)
2model/
classifier will be saved here.
3predict/
fragment classifications are saved here.
9) ./carnelian.py annotate [--optional-arguments] sample_dir model_dir out_dir fgsp_loc [-h]
Annotates the input nucleotide reads starting from gene finding and translation
on the reads fasta file in the sample_dir using FragGeneScan located in the fgsp_loc
directory, then classifying the predicted ORFs using the model in model_dir, and
outputting the labels in the out_dir.
Steps to be followed in a typical workflow is given in the workflow.txt file.
To replicate our classification performance analysis the code in performance_analysis.R can be used. Before running
the script the following packages need to be installed:
caret, pROC, ROCR, cvAUC, randomForest
Contact
Sumaiya Nazeen, nazeen@mit.edu
Acknowledgement
This implementation of Carnelian is adapted from the source code of the following papers:
Yunan Luo, Y. William Yu, Jianyang Zeng, Bonnie Berger, and Jian Peng. Metagenomic binning through low density hashing. Bioinformatics (2018), bty611, https://doi.org/10.1093/bioinformatics/bty611
K. Vervier, P. Mahe, M. Tournoud, J.-B. Veyrieras, and J.-P. Vert. Large-scale Machine Learning for Metagenomics Sequence Classification , Technical report HAL-01151453, May, 2015.