This repository contains the PAcbio Prediction of Epigenetics Technology (Papet) software to model the PacBio CCS kinetic signal and predict epigenetics modifications from this kinetic signal.
- Content
- Dependencies
- Compilation
- PacBio kinetics and epigenetics
- Running Papet
5.1. model-kinetic
5.2. model-kinetic-txt
5.3. kinetics
5.4. kinetics-wig
5.5. kinetics-kmer
5.6. model-sequence
5.7. model-sequence-txt
5.8. loglikelihoods
5.9. predict - [Aknow]
Papet relies on the following third party libraries:
- pthread, normally a system library on linux system.
- boost v1.78 or higher
- pbcopper v2.1.0 or higher
- pbbam v2.0.0 (pbbam relies on pbcopper)
- ngsaipp v1.6.1
We strongly recommend to install these libraries in /usr/local/lib
and their corresponding header files in /usr/local/include
. If done this way, cmake compilation configuration should work out of the box.
The building process uses cmake. To compile and install papet, simply type:
./build.sh
The CMakeLists.txt
file contains 3 variable defined at its top:
# user defined paths
## list of directories containing libraries headers
set(INCLUDE_DIRECTORIES "/usr/local/include/"
"/usr/local/include/ngsaipp")
## list of directories in which the required libraries are installed
set(LINK_DIRECTORY "/usr/local/lib")
## path in which the final executable will be installed
set(INSTALL_DIRECTORY "/usr/local/bin")
-
INCLUDE_DIRECTORIES
a space separated list of directories in which the required third party library header files are located. -
LINK_DIRECTORY
a space separated list of directories in which all the required libraries are installed. -
INSTALL_DIRECTORY
the directory in which papet will be installed.
PacBio BAM file epignetics signal data handling, modelling as well as the epigenetics prediction computations are performed by ngsaipp. For an in-depth dive on this topic, please read the dedicated documention
Papet is a wrapper application that can run different commands to fullfil specific tasks. The general synthax is:
papet [command] [options]
Papet accepts the following options and commands:
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--vaudois | Tastes quite good. | |
model-kinetic | Creates kinetic signal models from CCSs. | |
model-kinetic-txt | Dumps a kinetic signal model in txt format. | |
kinetics | Extracts CCS kinetic information in txt format. | |
kinetics-wig | Creates WIG tracks from CCSs. | |
kinetics-kmer | Computes the per-kmer distribution of kinetic signal from CCSs. | |
model-sequence | Creates DNA sequence kinetic signal models from CCSs. | |
model-sequence-txt | Dumps a DNA sequence kinetic signal model in txt format. | |
predict | Predicts the presence of epignetic modifications from CCSs. |
model-kinetic trains a kinetic signal model by computing the kinetic (IPD and PWD) signal distribution at each position inside windows of <N> bases. The signal distributions are learned by extracting the kinetic signal, from the reads, using <N> bases long windows centered on the given CpGs. The model is then serialized in the given file. The final results is a serialized instance derived from the KineticModel class (see ngsaipp KineticModel.hpp).
The synthax is:
papet model-kinetic [type] [options]
The exact type of kinetic signal model is defined using the first argument. The accepted values are:
raw
: simply computes the distribution of IPD and PWD at each position in the window. The model is a serialized instance of the RawKineticModel class (see ngsaipp RawKineticModel.hpp).raw-norm
: computes the normalized IPD and PWD' distributions at each position in the window. The model is a serialized instance of the NormalizedKineticModel class (see ngsaipp NormalizedKineticModel.hpp).diposition
: computes the distributions of IPD and PWD signal at each position in the window as a function of its direct neighbour. The model is a serialized instance of the DiPositionKineticModel class (see ngsaipp DiPositionKineticModel.hpp).diposition-norm
: computes the distributions of normalized IPD and PWD signal at each position in the window as a function of its direct neighbor. The model is a serialized instance of the DiPositionNormalizedKineticModel class (see ngsaipp DiPositionNormalizedKineticModel.hpp).pairwise
: computes all distributions of IPD and PWD as a function of the signal at another position in the window. The model is a serialized instance of the PairWiseKineticModel class (see ngsaipp PairWiseKineticModel.hpp)pairwise-norm
: computes all distributions of normalized IPD and PWD as a function of the signal at another position in the window. The model is a serialized instance of the PairWiseNormalizedKineticModel class (see ngsaipp PairWiseNormalizedKineticModel.hpp)
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam | A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest. | |
--bed | The path to the bed file containing the genomic coordinates of the CpGs of interest. | |
--out | The path to file in which the kinetic model will be saved. | |
--background | For normalized models only, the path to background model to use. It must contain a serialized KmerMap. | |
--size | The size of the model, the length of the signal window to model, in bp | |
--nbin | The number of bins in each histogram. | |
--xmin | The lower limit of the lower bin in each histogram. | |
--xmax | The upper limit of the upper bin in each histogram. | |
--pseudocount | A number of counts that will be added to each bin in each histogram, by default 0. | |
--thread | The number of threads, by default 1. |
model-kinetic-txt converts a kinetic signal model file in tsv format and prints the results on stdout. If the model is a normalized one, the KmerMap is not tincluded in the tsv conversion.
The synthax is:
papet model-kinetic-txt [options] [> FILE]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--model | The path to the file containing the kinetic model to convert in tsv format. |
kinetics is an application to extract interpulse duration (IPDs) and pulse widths (PWs) from mapped PacBio CCS reads that overlap a given set of genomic regions specified in a BED 6 file. The results are printed on stdout in tsv format. The first row is a header. Then, each line contains per read sequence, IPDs and PWDs. The memory footprint of this progam is O(n), where n is the window size. The results are immediately printed and nothing is kept in memory.
The synthax is:
papet kinetics [options] [> FILE]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam arg | A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest. | |
--bed arg | The path to the bed file containing the genomic regions of interest. | |
--winSize | The size in bp of the windows from which the CCS features will be extracted. It must be odd. The window will be centered on the center of the genomic regions of interst. |
kinetics-wig is an application that creates wig kinetic tracks from a set of PacBio CCS from from bam files, over a set of defined genomic windows. Exactly 4 tracks are created each time, corresponding to the IPD on forward strand, the IPD on the reverse strand, the PWD on the forward strand and the PWD on the reverse strand.
The synthax is:
papet kinetics-wig [options]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam arg | A coma separated list of paths to the bam files containing the mapped PacBio CCS from which the IPD/PWD track must be computed. | |
--bed | The path to the bed file containing the genomic oordinates of the CpGs of interest. | |
--out | A path prefix to use to write the results files. In total, 4 resulting files will be created with this prefix : _IPDfw.wig, _IPDrv.wig, _PWDfw.wig and _PWDrv.wig containing the IPD and PWD forward and reverse track respectively. | |
--winSize | The size of the window (in bp) around the CpGs in which the average kinetic signal will be computed. |
kinetics-kmer is an application that computes the interpulse duration (IPDs) and pulse widths (PWs) value distribution for all kmers found in a PacBio CCS dataset. The kmer IPD / PWD value is defined has the IPD / PWD value corresponding to the central position of the kmer. The results are printed on stdout in tsv format. The first row is a header. Then, each line contains the results for one kmer. Each line contains the kmer sequence (column 1), the IPD distribution for this kmer (columns 2 to 954) amd the PWD distribution (columns 955 1907). IPD / PWD values can take any score in the [0,952] interval. Thus there are 953 columns for each IPD / PWD distribution. The 1st IPD / PWD related value (column 2 / 955) corresponds to the number of times an IPD / PWD value of 0 was observed for this kmer. The 2nd IPD / PWD value (column 3 / 956) corresponds to the number of time an IPD / PWD value of 1 was observed for this kmer, and so on until a score of 952 (column 953 / 1907). In the final table, not all possible kmers will be present in the table, only the ones found in the data. The memory footprint of this progam is O(4^k) where k is the size of the kmers.
The synthax is:
papet kinetics-kmer [options]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam arg | A coma separated list of paths to the bam files containing the PacBio CCS of interest. | |
--kmer arg | The size of the kmer. It must be odd. |
model-sequence is an application that computes the mean IPD and PWD per position signal for all possible kmer from a set of CCS reads (see Expected kinetic background). The final results is a
serialized instance of KmerMap (see ngsaipp KmerMap.hpp). The serialized instance can then either be loaded in another program or converted in tsv format using model-sequence-txt
.
The memory footprint of this progam is O(4^k) where
k is the size of the kmers. Note that overflow errors may occure if a given kmer is encountered too many times in the dataset. This can be the
case if there are too many reads (unlikely) or if the kmer size is too small (experience shows that kmer of 3 leads to this issue). In any case, if the mean signal seems weird, this is a likely explanation.
The synthax is:
papet model-sequence [options]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam arg | A coma separated list of paths to the bam files containing the PacBio CCS of interest. | |
--out arg | The path to the file in which the KmerMap will be dumped. | |
--kmer arg | The size of the kmer. It must be odd. |
model-sequence-txt is an application that converts into tsv format a serialized KmerMap created by model-sequence
. The results are printed on stdout. The first row contains headers, the following rows contains the kmer data. Each row contains the kmer sequence, the kmer number of occurences, k
per position IPD means and k
per position PWD means.
The memory footprint of this progam is O(4^k) where k is the size of the kmers. Note that overflow errors may occure if a given kmer is encountered too many times in the dataset. This can be the case if there are too many reads (unlikely) or if the kmer size is too small (experience shows that kmer of 3 leads to this issue). In any case, if the mean signal seems weird, this is a likely explanation.
The synthax is:
papet model-sequence-txt [options] [>FILE]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--model arg | The path to the file containing the KmerMap to convert in tsv format. |
reports the detailled contributions of each strand of each CCSs of a model loglikelihood computations. The output is streamed on stdout in tab-separated format. Each row contains the contributions of a single CCS for a given CpG strand. The first row is a header.
The program is given a set of CpG coordinates of interest, a set of CCS a model the kinetic signal likelihood from. One row in the results contains the individual contributions of a given CCS for a given CpG on a given strand (fw or rv). The contributions are composed of the IPD contributions followed by the PWD contributions. The contributions may corresponds to individual positions, for simple models, or to pairs of positions for pairwise or diposition models. The contributions are always given in a 5'->3' fashion from left to right columns.
Contributions from a given CCS,for both strands of a given CpG are reported as 2 rows. One for the fw strand and the second for the rv strand.
The synthax is:
papet loglikelihoods [options] [>FILE]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam | A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest. | |
--bed | The path to a bed file containing the coordinates of the CpGs interest. | |
--model | The path to the file containing the kinetic model of intrest. |
predict predicts the presence of epigenetic modifications at the CpG of interest given two models and returns the results on stdout in BED format. The fields are :
- CpG chromosome
- CpG starting position (C)
- CpG ending position (past G)
- name (empty)
- CpG strand
- CpG methylation probability
- (optional, enabled with
--ids
) a coma separated list of CCSs names used to compute the methylation probability
The score field contains the probability of the presence of an epigenetic modification, the
The synthax is:
papet predict [options] [>FILE]
This program has the following options :
short | long | description |
---|---|---|
-h | --help | Produces the help message |
--bam | A coma separated list of paths to the bam files containing the mapped PacBio CCS of interest. | |
--bed | The path to a bed file containing the coordinates of the CpGs interest. | |
--modelMeth | The path to the file containing the methylated kinetic model to use. | |
--modelUnmeth | The path to the file containing the unmethylated kinetic model to use. | |
--prob | The prior probability of methylation for any CpG. It must belong to [0,1]. 0.5 by default. | |
--thread | The number of threads, by default 1. |
This program requires that the input bam and bed files have
been ordered by chromosome and position using
samtools sort
and sort -k1,1V -k2,2n -k3,3n
respectively!
Richard Hall and Pacific Biosciences for the discussions and the shared data.
- Romain Groux