Here we provide a brief walkthrough to run mAFiA, using the example of chromosome X.
- The following steps are tested with python verion 3.9. Get code and activate virtual environment, e.g.:
git clone https://github.com/dieterich-lab/mAFiA.git
cd mAFiA
python3 -m venv mafia-venv
source mafia-venv/bin/activate
If you pip version is <21.3, then upgrade it to a newer version:
python3 -m pip install --upgrade pip
Install package
pip install -e .
- Download models and data from here
- The folder "models" contains:
- backbone.torch: RODAN-based neural network for basecalling and feature extraction
- backbone.config: training configuration for backbone
- classifiers: pickled logistic regression models
- The folder "data" contains a subset of input data on chr X:
- fast5_chrX: dRNA-Seq raw data from HEK293 WT mRNA
- GRCh38_96.X: genome reference
- GLORI_chrX.bed: query modification sites in bed format. This file specifically corresponds to those listed in GLORI.
- The folder "models" contains:
- Assume that data and model are unzipped to ${data} and ${model} respectively. Your output directory is ${output}
backbone="${models}/backbone.torch"
classifiers="${models}/classifiers"
fast5dir="${data}/fast5_chrX"
ref="${data}/GRCh38_96.X.fa"
mod="${data}/GLORI_chrX.bed"
mkdir -p "${output}"
basecall="${output}/rodan.fasta"
bam="${output}/minimap.q50.bam"
The basecalling script is adapted from the RODAN repository. Assume that ${mafia} is your code directory.
python3 ${mafia}/RODAN/basecall.py \
--fast5dir ${fast5dir} \
--model ${backbone} \
--batchsize 4096 \
--outdir ${output}
On a reasonably modern GPU machine, this step should take less than 30 mins.
Align basecalling results to reference genome. Filter, sort, and index BAM file.
minimap2 --secondary=no -ax splice -uf -k14 -t 36 --cs ${ref} ${basecall} \
| samtools view -bST ${ref} -q50 - \
| samtools sort - > ${bam}
samtools index ${bam}
After the standard procedures, we can now measure m6A stoichiometry of the sites specified in ${mod}.
test_mAFiA \
--bam_file ${bam} \
--fast5_dir ${fast5dir} \
--ref_file ${ref} \
--mod_file ${mod} \
--min_coverage 50 \
--max_num_reads 1000 \
--backbone_model_path ${backbone} \
--classifier_model_dir ${classifiers} \
--mod_prob_thresh 0.5 \
--out_dir ${output}
The last step should take less than 1 hour. We are currently working on integrating the feature extraction step directly into basecalling. The run-time should then be significantly reduced.
In your ${output} directory, you should now see two files:
- "mAFiA.sites.bed": List of sites with coverage above minimum threshold (default 50). The column "modRatio" lists the site's stoichiometry.
- "mAFiA.reads.bam": Aligned reads identical to those in the input BAM file ${bam}, but with additional MM and ML tags that mark the location and modification probability in each individual read. The results can be visualized with, eg, IGV.
The complete HEK293 WT dataset for all chromosomes can be downloaded from zenodo.