RNA editing is a widespread phenomenon that drives transcriptomic diversity and influences cellular differentiation and function. Despite its importance, existing tools are inefficient for comprehensive analysis of RNA editing in single-cell datasets. We introduce MARINE (Multi-Core Algorithm for Rapid Identification of Nucleotide Edits), a powerful and efficient tool for mapping the RNA editing landscape at single-cell resolution.
Comparison to existing tools for bulk RNA-seq data:
Simply git clone this repository using the link at the top right on the main repository page. Should take at most a minute or two.
git clone https://github.com/YeoLab/MARINE.git
MARINE was developed and tested using Python 3.8.18 and is not guaranteed to work with earlier versions of Python. Use the provided .yml file to create a new conda environment that contains all required dependencies, including the proper Python version, for MARINE:
conda env create --file=marine_environment2.yaml
conda activate marine_environment
or if you encounter problems with this approach, try using mamba instead, which can be faster:
conda install -c conda-forge mamba=1.5.11 -y
mamba create -n marine_environment python=3.8.18 -y
mamba env update -n marine_environment --file marine_environment2.yaml
conda activate marine_environment
- MARINE relies on the MD tag being present in your .bam file. Some tools like STAR provide the option to add the MD tag during alignment, but otherwise you may have to add it after alignment. To add the MD tag and then index the processed bam, use the following samtools command templates:
samtools calmd -bAr input.bam reference_genome.fa > input.md.bam
samtools index input.md.bam
- The more cores used, the faster MARINE will run, up to a point
- Ensure that your annotation bedfile has the same chromosome nomenclature (e.g., "9" vs "chr9") as your bam
- The annotation bedfile should be tab-separated and should have a standard bed6 column ordering, as follows:
1 29554 31109 MIR1302-2HG lincRNA +
1 34554 36081 FAM138A lincRNA -
1 65419 71585 OR4F5 protein_coding +
1 89295 133723 AL627309.1 lincRNA -
1 89551 91105 AL627309.3 lincRNA -
usage: marine.py [-h] [--bam_filepath BAM_FILEPATH] [--annotation_bedfile_path ANNOTATION_BEDFILE_PATH]
[--output_folder OUTPUT_FOLDER] [--barcode_whitelist_file BARCODE_WHITELIST_FILE] [--cores CORES]
[--strandedness {0,1,2}] [--barcode_tag BARCODE_TAG] [--min_dist_from_end MIN_DIST_FROM_END]
[--min_base_quality MIN_BASE_QUALITY] [--contigs CONTIGS] [--min_read_quality MIN_READ_QUALITY]
[--sailor [SAILOR]] [--bedgraphs [BEDGRAPHS]] [--verbose] [--keep_intermediate_files]
[--num_per_sublist NUM_PER_SUBLIST] [--paired_end] [--all_cells_coverage]
[--tabulation_bed TABULATION_BED] [--interval_length INTERVAL_LENGTH]
Run MARINE
optional arguments:
-h, --help show this help message and exit
--bam_filepath BAM_FILEPATH
Full path to MD-tagged and indexed .bam file
--annotation_bedfile_path ANNOTATION_BEDFILE_PATH
Full path to bed file with desired annotations in bed6 format (contig start end label1
label2 strand)
--output_folder OUTPUT_FOLDER
Directory in which all results will be generated, will be created if it does not exist
--barcode_whitelist_file BARCODE_WHITELIST_FILE
List of cell barcodes to use for single-cell analysis
--cores CORES Number of CPUs to use for analysis. Will default to using all cores available if not
specified
--strandedness {0,1,2}
Possible values include: 0 (unstranded), 1 (stranded) and 2 (reversely stranded).
--barcode_tag BARCODE_TAG
CB for typical 10X experiment. For long-read and single-cell long read analyses, manually
add an IS tag for isoform or an IB tag for barcode+isoform information. Do not provide any
arguments when processing bulk seqencing
--min_dist_from_end MIN_DIST_FROM_END
Minimum distance from the end of a read an edit has to be in order to be counted
--min_base_quality MIN_BASE_QUALITY
Minimum base quality, default is 0
--contigs CONTIGS Which contigs to process, in comma separated list (ie 1,2,3 or chr1,chr2,chr3, whichever
matches your nomenclature)
--min_read_quality MIN_READ_QUALITY
Minimum read quality, default is 0... every aligner assigns mapq scores differently, so
double-check the range of qualities in your sample before setting this filter
--sailor [SAILOR] Generate SAILOR-style outputs.
--bedgraphs [BEDGRAPHS]
Conversions for which to output a bedgraph for non-single cell runs, (e.g. CT,AI)
--verbose
--keep_intermediate_files
Keep intermediate files for debugging or to use --all_cells_coverage flag
--num_per_sublist NUM_PER_SUBLIST
For single-cell datasets, specifies 'chunking', ie how many contigs to process at once. This
can be lowered to enable lower-memory runs, with the tradeoff of longer runtime
--paired_end Assess coverage taking without double-counting paired end overlapping regions... slower but
more accurate. Edits by default are only counted once for an entire pair, whether they show
up on both ends or not.
--all_cells_coverage Requires --keep_intermediate_files flag to be set. Caution: this can take a long time if too
many sites are used (think thousands of sites x thousands of cells... it gets big quickly),
it is worth reducing the number of sites to tabulate through filtering beforehand, and using
the additional argument --tabulation_bed to specify these sites.
--tabulation_bed TABULATION_BED
Locations to run tabulation across all cells. The fist column should be contig, the second
should match the position in the final_filtered_sites_info.tsv file.
--interval_length INTERVAL_LENGTH
Length of intervals to split analysis into... you probably don't have to change this.
The examples should take no more than a few minutes to run, especially if multiple CPUs are avilable for use (and specified using the --cores arugment). MARINE was developed and tested in Linux running on x86_64 but should work without any special hardware. Please let us know if you encounter any problems by creating a GitHub issue in this repository.
Expected example outputs are contained in the subfolders in the examples folder.
MARINE will calculate edits and coverage on a per-cell basis. For example, the G at position 3000525 occurs in a region in the cell with the barcode ending in CGG-1, which only has 4 reads at that location. Meanwhile, the T at this position occurs instead in the cell with barcode ending in CAC-1 with 12 reads. These cell-specific edit counts and coverages will be reflected in MARINE outputs. Strandedness for 10X inputs should be 2.
Note: MARINE by default will filter out UMI duplicates based on the xf:i:25 tag at explained here: https://www.10xgenomics.com/analysis-guides/tutorial-navigating-10x-barcoded-bam-files
python marine.py \
--bam_filepath examples/data/single_cell_CT.md.subset.bam \
--output_folder examples/sc_subset_CT \
--barcode_whitelist_file examples/data/sc_barcodes.tsv.gz \
--barcode_tag "CB" \
--strandedness 2 \
--all_cells_coverage # This flag will generate barcode x position coverage matrices.
A recommended workflow for single-cell:
- Run MARINE without the --all_cells_coverage flag set
- Filter resulting edit sites down to a set you want to proceed with
- Only run --all_cells_coverage in tandem with --tabulation_bed, which is where you can specify which sites should have coverage calculated across all cells. Otherwise this bit could take a very long time.
MARINE can be used to calculate edits and coverage on a per-cell and per-isoform basis after certain pre-processing steps are taken. Reads can be quantified and assigned to annotated isoforms using IsoQuant (v3.3.0) with parameters: --data-type pacbio, --transcript_quantification unique_only, and --gene_quantification unique_only. The read assignment output from IsoQuant can be used to add an isoform tag for each read, indicating the isoform to which it was assigned. Furthermore, the cell barcode can be concatenated to the isoform in a new tag called "IB", as shown in the IGV screenshot below (grouping labels refer to this tag in this case). Note that a suffix has been added to each IB tag reflecting the ending of both the isoform ID and the cell barcodes, which is used for efficiently calculating coverage only within each appropriate subset of isoform and cell-specific reads.
python marine.py \
--bam_filepath examples/data/LR_single_cell.md.subset.filtered.sorted.bam \
--output_folder examples/sc_lr_subset_CT \
--barcode_whitelist_file examples/data/sc_lr_barcodes.tsv.gz \
--barcode_tag "IB" \
--strandedness 2
This is derived from an APOBEC1-fusion experiment, so we expect an enrichment for C>T (C>U) edits.
python marine.py \
--bam_filepath examples/data/bulk_CT.md.subset.bam \
--output_folder examples/bulk_subset_CT \
--cores 16 \
--annotation_bedfile_path /annotations/hg38_gencode.v35.annotation.genes.bed \
--strandedness 2
--contigs "1"
--bedgraphs "CT"
This is derived from an APOBEC1-fusion experiment, so we should also expect to see an enrichment for C>T (C>U) edits:
Likewise, using bulk_subset_AI.md.subset.bam, which derives from an experiment using an A to I editor, we should expect to see to see an enrichment for A>G (I is interpreted as a G) edits:
python marine.py \
--bam_filepath examples/data/bulk_CT.md.subset.bam \
--output_folder examples/bulk_subset_CT \
--cores 16 \
--annotation_bedfile_path /annotations/hg38_gencode.v35.annotation.genes.bed \
--contigs "chr1" \
--strandedness 2 \
--paired_end