PoreOver is a basecalling tool for the Oxford Nanopore sequencing platform and is primarily intended for the task of consensus decoding raw basecaller probabilities for higher accuracy 1D2 sequencing. PoreOver includes a standalone RNN basecaller (PoreOverNet) that can be used to generate these probabilities, though the highest consensus accuracy is achieved in combination with Bonito, one of ONT's research basecallers.
More generally, PoreOver can serve as platform on which to explore new decoding algorithms and basecalling architectures.
If you find it useful, please cite:
Silvestre-Ryan J, Holmes I. Pair consensus decoding improves accuracy of neural network basecallers for nanopore sequencing. Genome biology. 2021 Dec;22(1):1-6.
- Python 3
- TensorFlow 2
git clone https://github.com/jordisr/poreover
cd poreover
pip install -r requirements.txt
pip install -e .
Now the software can be run with:
poreover --help
PoreOver has four main modes:
call
Run a forward pass of PoreOver's neural network and save the probabilitiesdecode
Decode the output of PoreOver or another CTC basecaller using Viterbi or beam searchpair-decode
Generate consensus sequences given a list of paired read probabilitiestrain
: Train a new neural network using PoreOver
PoreOver includes a simple basecalling network with an architecture inspired by other community basecallers such as DeepNano (Boža et al. 2017) and Chiron (Teng et al. 2018). It uses a single convolutional layer followed by three bidirectional GRU layers, and is trained with CTC loss. It is not as accurate as Bonito and is intended mostly for testing.
poreover call data/read.fast5
This will run the forward pass of the network and save the logits output to read.npy
. This can then be passed to the decode
It will also take a directory as an input, e.g.
poreover call data/reads
A nucleotide sequence can then be decoded to FASTA format using decode
. The basecaller must be specified to correctly parse the file.
poreover decode read.npy --basecaller poreover
By default, this does Viterbi (i.e. best path) decoding, though alternatively --algorithm beam
will perform a beam search, with the beam width configurable with --beam_width
.
While beam search may outperform Viterbi decoding, in our experience any improvement is not usually worth the increased computational cost.
Pair decoding can run either on a single pair as in
poreover pair-decode data/reads/read1.npy data/reads/read2.npy --reverse_complement --basecaller poreover
Or using a list of read pairs (provided probabilities have already been generated with a basecaller).
poreover pair-decode data/pairs.txt --reverse_complement --basecaller poreover
Each line in the pairs file must specify the paths to two reads' neural network output. As a convenience, if the listed file ends with .fast5
(as in the example) poreover
will look for the appropriate .npy
file.
As Bonito does not currently support saving the basecaller probabilities, a slight modification must be made to allow this. This can be done using the bonito022.patch file (needs Bonito version 0.2.2).
git clone https://github.com/nanoporetech/bonito --branch v0.2.2
cd bonito
git apply ../poreover/data/bonito022.patch # substitute path to poreover repo
pip install -r requirements.txt
pip install -e .
Now, running Bonito will generate a .npy file for each read, named by the read ID.
Flip-flop is a CTC variation developed by ONT and implemented in their production basecaller Guppy, as well as the research bascaller Flappie. Both of these basecallers can optionally output a trace of probabilities using the --fast5-out
(in Guppy) or the --trace
(Flappie) option.
PoreOver can read and decode these probabilities, yielding a basecalled sequence.
poreover decode data/flappie_trace.hdf5
poreover decode data/guppy_flipflop.fast5
Note: Beam search decoding (and by extension pair decoding) does not seem to perform well on the flip-flop model and so is not recommended.
It is possible to use one of the architectures available in PoreOver to train a new basecalling model.
poreover train --data data/training.npz --loss_every 5 --epochs 10
This will use the default architecture of 1 convolutional + 3 bidirectional GRU layers, though there are a few other named architectures listed in poreover/network/network.py
. Checkpoints as well as the final parameters will be written to a directory named with [model architecture]_[run name]_[start date]_[start time]
. This directory can be passed to the call
command (as shown below).
Once there is a model to load, we can make a basecall on a sample read (of course, after only a little training on a toy dataset we would not expect it to be very accurate).
poreover call data/read.fast5 --weights $RUN_DIRECTORY --model $RUN_DIRECTORY/model.json