Skip to content

A high-throughput error correction pipeline for NGS data using Unique Molecular Identifiers (UMIs).

License

Notifications You must be signed in to change notification settings

epiliper/rumina

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RUMINA - Rust-based Unique Molecular Identifier Network Analysis

RUMINA is a performant pipeline for error correction of Next-Generation Sequencing (NGS) data using Unique Molecular Identifiers (UMIs), fit for shotgun-based and amplicon-based methodologies.

RUMINA deduplicates reads associated with a single template molecule, using the coordinates of read alignment and UMI barcode sequence to perform correction of PCR and sequencing errors. The above strategy also allows for correction of errors in UMI sequences (directional clustering).

This pipeline is tested for processing ~600 million reads in ~5 hours, at a rate of 120 million reads processed per hour (tested on 10-core M1 Max Mac Studio).

Workflow:

Installation:

Release binaries

Navigate to releases and download the zip for your system's CPU architecture. Unzip and cd into the directory, and run ./rumina -h to ensure it's working.
It's recommended to move the binary to someplace in your $PATH for convenience.

Compiling from source

dependencies:

export RUSTFLAGS="-C target-cpu=native" 

git clone https://github.com/epiliper/rumina.git
cd rumina
cargo build --release 
mv target/release/rumina .

The binary will be located at ./rumina/rumina.
NOTE: Using this option may yield performance gains, as the target-cpu=native flag is not used when making the release binaries.

Usage:

rumina <input (file or directory)> -g <grouping_method> -s <separator> <optional args>

an example command:
rumina example.bam -g directional -s : -x 100 -t 8


The input to rumina can be a file or a directory; if a directory, all BAM files within (excluding pipeline products) are processed.

Arguments

🔹 = mandatory, no default

input 🔹

The input file or directory. If a file, it must be:

  • in BAM format
    • The UMI should be in the read QNAME field (see image under --separator). Illumina data base-called by BCL Convert should be formatted this way by default.
    • sorted and indexed with samtools sort and samtools index beforehand.

If the input is a directory, all BAM files within (excluding pipeline products) will be processed per the other arguments specified.

-g, --grouping_method 🔹

Specifies how/if to merge UMIs based on edit distance, to account for PCR mutations and NGS errors in UMI sequence. Options are:

  • directional: Merge UMIs via directional clustering. See Amplicon section for more details. This is the best option for amplicon sequencing data.
  • acyclic: Same as directional clustering, except UMI networks are limited to a depth of 1, i.e. UMIs that are predicted children cannot have UMIs as child nodes.
  • raw: Treat each UMI as genuine; UMIs are not merged. This is the best option for metagenomics/shotgun sequencing data.
-s, --separator 🔹

Specifies the character in the read QNAME delimiting the UMI barcode from the rest of the string. This is usually _ or :.

-x, --split_window (default = auto)

dictates how to split input BAM files into subfiles (for avoiding memory overflow).

This is usually necessary for BAM files with high sequencing depth that would otherwise cause the program to overuse available memory.For this reason, this value is calculated by default unless otherwise specified.

Splitting happens along coordinates of the reference genome in the specified BAM file; If --split_window 100 was used, reads for every 100bp stretch of the reference would be processed in separate batches, prior to being written to output. This applies to every reference genome present in the input alignment.

Options are:

  • auto: calculate the recommended split size (in basepairs along genome) based on input file size. If input is a directory, this will be applied independently to each file within the directory
  • positive integer: split input files by a fixed window size. If input is a directory, this will be applied to each file within the directory. This has been tested with values ranging from 50 - 500.
  • none (default): process the input as one file without splitting by coordinate window. Using this option with larger BAMs may result in memory overuse.
-t, --threads (default = all)

Specifies the number of threads RUMINA is allowed to use. Threads are used to parallelize processing of individual reference coordinates, and for I/O operations.

By default, RUMINA will attempt to use all available threads.

-l, --length (optional)

if used, groups reads by length as well as coordinate. This is recommended for metagenomics data with high read depth, as this will group reads more stringently and likely produce more singleton groups.

--only-group (optional)

if used, reads will be grouped (assigned a group-specific "UG" tag), but not deduplicated or error-corrected. This is useful if you want to manually check how grouping works with a given file.

--outdir (default = rumina_output)

The output directory, relative to the parent directory of the input files/directory, in which RUMINA's output will be stored.

Arguments for paired-end input

using either of these arguments will automatically treat the input as paired-end.

--halve_pairs (optional)

Use only R1 for deduplication, discarding R2. This is similar to UMI-tools, in that R2 reads are not part of UMI clusters.

--merge_pairs (REF_FASTA, 🔹)

Use both R1 and R2 for deduplication, and merge overlapping forward/reverse reads with the same barcode after initial deduplication. Merged reads are then realigned to the reference genome, which should be supplied in FASTA format. This is untested with segmented genomes or eukaryotic genomes, and is under active development.

Forward/reverse pairs are merged only if they contain a minimum number of overlapping bases, which is controlled by the --min_overlap_bp argument. Forward/reverse pairs identified to have discordant sequences are discarded, and reads unable to be merged for other reasons are still written to output.

--min_overlap_bp (default = 3)

The minimum number of bases shared by two reads at the same reference coordinates for merging to occur in -merge_pairs. Reads not discordant in sequence but not meeting this threshold will not be merged, and instead both written to the output file.

About

A high-throughput error correction pipeline for NGS data using Unique Molecular Identifiers (UMIs).

Resources

License

Stars

Watchers

Forks

Packages

No packages published