Skip to content

bartongroup/yanocomp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

yanocomp

(yet another nanopore modification comparison tool)

DOI

yanocomp does detection of RNA modifications from nanopore direct RNA sequencing (DRS) data which has been "eventaligned" using nanopolish. It uses a comparative approach with General mixture models similar to those used by nanocompore and xpore. The main selling points of yanocomp are:

  • It parses nanopolish eventalign output on the fly into a (relatively) compact HDF5 file allowing random access.
  • A GTF file can be provided to convert transcriptomic coordinates from eventalign back to genomic coordinates.
  • It uses the fast pomegranate library for model fitting.
  • It fits models using multiple adjacent kmers to better separate modified and unmodified distributions.
  • A uniform distribution is used to model outliers caused by low quality signal or poor alignment. This improves the fit and reduces false positives.
  • The trained models can be used to make single molecule modification predictions for each read at each significant position.

For the comparative method, you will need nanopore DRS data from a control sample with normal levels of modifications and a treatment sample with altered levels of modifications. NB: yanocomp is still in quite a beta stage so there are likely to be bugs!

Installation:

yanocomp has been tested with python 3.8.5, and requires numpy, pandas, h5py, pomegranate, scipy, statsmodels, and click. The easiest way to install it is using the conda environment yaml provided:

git clone https://github.com/bartongroup/yanocomp.git
cd yanocomp
conda env create -f yanocomp.yml
conda activate yanocomp

Alternatively yanocomp and the required packages can be installed using pip:

pip install git+git://github.com/bartongroup/yanocomp.git

Usage:

Before running yanocomp, you should align your nanopore DRS data to a transcriptome reference (with no spliced alignment, no secondary alignments), for example using minimap2. You should then align the signal level data using nanopolish with the command:

nanopolish eventalign \
  --scale-events \
  --signal-index \
  --print-read-names \
  -r <fastq-fn> \
  -b <bam-fn> \
  -g <reference-fasta-fn>

yanocomp itself has two main commands:

prep:

yanocomp prep will parse a text or gzipped file containing the tabular results from nanopolish eventalign and summarise them into a more compact HDF5 file. Output from eventalign can also be piped to yanocomp prep on the fly. Basecalled reads should be aligned to a transcriptome reference before event alignment, but yanocomp prep can convert back to genomic coordinates if a GTF file is provided.

Options:

$ yanocomp prep --help
Usage: yanocomp prep [OPTIONS]

  Parse nanopolish eventalign tabular data into a more managable HDF5
  format...

Options:
  -e, --eventalign-fn TEXT  File containing output of nanopolish eventalign.
                            Can be gzipped. Use - to read from stdin
                            [default: -]
  -h, --hdf5-fn TEXT        Output HDF5 file  [required]
  -g, --gtf-fn TEXT         Optional GTF file which can be used to convert
                            transcriptomic coordinates to genomic
  -p, --processes INTEGER   [default: 1]
  --help                    Show this message and exit.

Example usage:

$ nanopolish eventalign -t 28 \
  --scale-events --signal-index --print-read-names \
  -r reads.fastq \
  -b aligned.bam \
  -g transcriptome.fasta |
diffmod prep -p 4 \
  -g annotation.gtf \
  -h output.hdf5

Output:

A HDF5 file storing summarised event level signal data. The file contains one group per gene (if a GTF is provided) or transcript (if no GTF is provided).

gmmtest:

yanocomp gmmtest compares samples from two conditions to identify positions with differences in the profile of aligned nanopore signal "events". Positions where there are detectable differences in signal (detected with Kolmogorov Smirnov test, default p<0.05 and KS statistic > 0.1) will progress to model fitting. This is done by fitting a two-gaussian general mixture model (plus a third uniform distribution for outliers) using pomegranate and then estimating the fraction of modified and unmodified reads in each sample. These estimates are used to perform a G-test. Fitted models can be optionally used to make single-molecule modification predictions.

Options:

$ yanocomp gmmtest --help
Usage: yanocomp gmmtest -c <cntrl_hdf5_1> \
                        -c <cntrl_hdf5_2> \
                        -t <treat_hdf5_1> \
                        -t <treat_hdf5_2> \
                        [OPTIONS]

  Differential RNA modifications using nanopore DRS signal level data

Options:
  -c, --cntrl-hdf5-fns TEXT       Control HDF5 files. Can specify multiple
                                  files using multiple -c flags  [required]
  -t, --treat-hdf5-fns TEXT       Treatment HDF5 files. Can specify multiple
                                  files using multiple -t flags  [required]
  -o, --output-bed-fn TEXT        Output bed file name  [required]
  -s, --output-sm-preds-fn TEXT   JSON file to output single molecule
                                  predictions. Can be gzipped (detected from
                                  name)
  -m, --model-fn TEXT             Model file with expected kmer current
                                  distributions
  --test-level [gene|transcript]  Test at transcript level or aggregate to
                                  gene level  [default: gene]
  -w, --window-size INTEGER       How many adjacent kmers to model over
                                  [default: 5]
  -e, --outlier-factor FLOAT      Scaling factor for labelling outliers during
                                  model initialisation. Smaller means more
                                  aggressive labelling of outliers  [default:
                                  0.5]
  -n, --min-read-depth INTEGER    Minimum reads per replicate to test a
                                  position. Default is to set dynamically
  -d, --max-fit-depth INTEGER     Maximum number of reads per replicate used
                                  to fit the model  [default: 1000]
  -k, --min-ks FLOAT              Minimum KS test statistic to attempt to
                                  build a model for a position  [default: 0.2]
  -f, --fdr-threshold FLOAT       False discovery rate threshold for output
                                  [default: 0.05]
  -p, --processes INTEGER RANGE   [default: 1;x>=1]
  --help                          Show this message and exit.

Example usage:

$ yanocomp gmmtest -p 12 \
  -c cntrl_1.hdf5 -c cntrl_2.hdf5 -c cntrl_3.hdf5 \
  -t treat_1.hdf5 -t treat_1.hdf5 -t treat_3.hdf5 \
  -o output.bed -s output_sm_preds.json.gz

Output:

A 19-column BED file format with the following values:

1. chrom [string]
2. start [integer]
3. end [integer]
4. gene_id:kmer [string]
5. score (min(- log10(FDR), 100)) [integer]
6. strand [string, either '+' or '-']
7. modification log ratio (plus 95% conf intervals) [str]
8. G-test p-value [float]
9. G-test FDR [float]
10. Control sample estimated fraction modified [float]
11. Treatment sample estimated fraction modified [float]
12. G-statistic (Treat vs cntrl) [float]
13. Homogeneity G-statistic [float]
14. Unmodified distribution mean for central kmer [float]
15. Unmodified distribution standard deviation [float]
16. Modified distribution mean for central kmer [float]
17. Modified distribution standard deviation [float]
18. KS statistic [float]

gmmtest can also optionally output a json file containing the single molecule modification predictions. The schema may change but is currently:

{
    "input_fns": {
        "cntrl": {
            "$rep": "<file_name>"
        },
        "treat": {
            "$rep": "<file_name>"
        }
    }
    "single_molecule_predictions": {
        "$gene_id": {
            "$pos": {
                "kmers": [...], # array of strings, shape (window_size,)
                "model": {
                    "unmod": {
                        "mu": [...], # array of floats, shape (window_size,)
                        "cov": [...], # array of floats, shape (window_size, window_size)
                    },
                    "mod": {
                        "mu": [...],
                        "cov": [...],
                    },
                    weights: [...] # array of floats, shape (2,)
                },
                "cntrl": {
                    "$rep": {
                        "read_ids": [...], # array of strings, shape (nreads,)
                        "events": [...], # array of floats, shape (nreads, window_size)
                        "preds": [...], # array of floats, shape (nreads,)
                    } 
                },
                "treat": {
                    "$rep": {
                        "read_ids": [...],
                        "events": [...],
                        "preds": [...],
                    }
                }
            }
        }
    }
}

Citing yanocomp:

A preprint describing yanocomp is available on bioRxiv

Parker,M.T. et al. (2021) 
Yanocomp: robust prediction of m6A modifications in individual nanopore direct RNA reads.
bioRxiv, 2021.06.15.448494.

About

Yet another nanopore modification comparison tool

Resources

License

Stars

Watchers

Forks

Packages

No packages published