A deep learning based tool for consensus polishing.
Roko is a consensus polisher which takes draft assembly and aligned reads in BAM format and outputs a set of contigs in FASTA format. It uses deep learning architecture to produce high quality consensus. Features are represented as sampled reads in a window and labels are mapped to draft assembly in Medaka-style fashion.
- Check HTSlib dependencies.
- gcc 5.0+ and g++
- python 3.6 or 3.7 (python3-dev and venv)
git clone https://github.com/lbcb-sci/roko.git roko
cd roko
make gpu
git clone https://github.com/lbcb-sci/roko.git roko
cd roko
make cpu
To activate virtual environment:
. $PROJECT_DIR/roko/bin/activate
To generate features for model training or inference:
python features.py [options ...] <ref> <X> <o>
<ref>
Draft sequence in FASTA format
<X>
Reads aligned to <ref> in BAM format
<o>
Output name (e.g. output.hdf5)
options:
--Y
Truth genome aligned to <ref> in BAM format (training only)
--t
default: 1
Number of worker processes
To generate BAM files for feature generation pomoxis mini_align method is recommended.
To train a model:
python train.py [options ...] <train> <out>
<train>
Directory containing generated .hdf5 files used for training (or one .hdf5 file)
<out>
Directory for saving trained model
options:
--val
Directory containing generated .hdf5 files used for validation (or one .hdf5 file)
--b
default: 128
Batch size used for train and validation
--memory
default: False
If flag is present, traning and validation data is stored in RAM
--t
default: 0
Number of workers for train and validation data loaders (--t for train data loader and --t for validation)
To make inference:
python inference.py [options ...] <data> <model> <out>
<data>
Path to the generated features in .hdf5
<model>
Path to the saved model in .pth format
<out>
Path to the output file (FASTA format)
options:
--t
default: 0
Number of workers for inference
--b
default: 128
Inference batch size
The model was trained and tested on FASTQ Basecalls from Zymo R10 Native “3 Peaks”. Data was binned using Loman's script. Draft assemblies were generated using raven. BAM files used for feature generation and BAM files used for labeling were generated by mini_align script from pomoxis tool.
Organisms used for training are: B. subtilis, E. faecalis, E. coli, L. Monocytogenes and S. enterica. P. aeruginosa was used for validation. Models are tested on S. aureus. Results were evaluated using pomoxis assess_assembly script.
The (mean) results are given in the following table:
Model | Total error | Mismatch | Deletion | Insertion | Qscore |
---|---|---|---|---|---|
Raven | 0.160% | 0.040% | 0.059% | 0.061% | 27.97 |
Medaka | 0.037% | 0.012% | 0.007% | 0.017% | 34.30 |
HELEN | 0.066% | 0.019% | 0.031% | 0.016% | 31.78 |
Roko | 0.035% | 0.013% | 0.008% | 0.013% | 34.55 |
Total error does not correspond to the sum of errors because of rounding.
The model stated in comparison section (R10, Guppy 2.3.8) can be downloaded here.
This tool is still in an early development stage. All bugs and questions can be reported to: dominik.stanojevic@fer.hr, mile.sikic@fer.hr or mile_sikic@gis.a-star.edu.sg.