INVAS is a conjugate graph-based RNA assembler designed for the precise detection of intragenomic inversions and transcriptome assembly.
It supports two analysis modes:
- WGS + RNA mode (recommended): highest confidence inversion detection
- RNA-only mode: assembly and inversion discovery driven purely by RNA-seq data
INVAS leverages an iterative conjugate maximum-flow network model to reconstruct both normal and inverted transcripts.
INVAS reconstructs expressed intragenomic inversions by combining structural variation signals and transcriptome evidence.
Depending on data availability, users can choose between:
| Mode | Required Data | Use Case |
|---|---|---|
| WGS + RNA | RNA-seq + WGS BAM | Accurate inversion breakpoint detection |
| RNA-only | RNA-seq BAM | When WGS data is unavailable |
Both modes ultimately perform conjugate graph-based transcriptome assembly.
- Detection of intragenomic inversions
- Dual running modes: WGS+RNA and RNA-only
- Integration of multiple SV callers (WGS+RNA mode)
- Refinement of inverted splice junctions
- Conjugate graph-based transcript assembly
- Reconstruction of normal and inverted isoforms
- Re-alignment of unmapped RNA-seq reads
- Iterative conjugate maximum-flow network model
- Conda / Mamba
- Reference genome (hg19 or hg38)
- HISAT2 index
- Gene annotation (GENCODE BED)
git clone git@github.com:deepomicslab/INVAS.git
cd INVAS
2. Create conda environments:
```bash
# Data preparation environment
conda env create -f invas_prepare.yaml -n invas_data_prep
# Assembly environment
conda env create -f invas_assembly.yaml -n invas_assembly2.1. Build SSW library:
# Clone SSW library repository
git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
# Navigate to SSW src directory
cd Complete-Striped-Smith-Waterman-Library/src
# Compile SSW library
make
# Copy the compiled library to INVAS bin directory
cp libssw.so ../../scripts/full_pipe/bin/
# Return to INVAS main directory
cd ../../
2.2. Compile INVAS core components:
cd scripts/full_pipe/bin
# Run CMake to configure the build
cmake ../../../
# Compile the source code
make
# Make all executables in bin directory executable
chmod +x *
# Return to the main directory
cd ../../../
INVAS supports two distinct workflows. Please choose one based on your available data.
This mode uses WGS-derived SV breakpoints to guide transcript assembly.
Before executing the INVAS core assembly module (main.py), three preprocessing steps are required to obtain informative reads for accurate inversion detection and transcript assembly.
Process RNA-seq data for quality control, alignment, and unmapped read extraction.
conda activate invas_data_prep
# Create output directory
mkdir -p rna_out/$sample_name
# Run RNA processing
bash scripts/full_pipe/preprocess/process_rna_common.sh -n $sample_name \
-b $input_bam \
-o rna_out/$sample_name \
-r $reference_genome \
-x $hisat_index \
-t $threads \
-p scripts/full_pipe/bin/picard.jarOutput files:
hisat.map_unmapremap.s.bam: Mapped and remapped readsstill_unmap_bwa.s.bam: Unmapped reads aligned with BWA
Process WGS data and detect structural variants using multiple SV callers.
conda activate invas_assembly
# Create output directory
mkdir -p sv_out/$sample_name
# Run SV detection
bash scripts/full_pipe/preprocess/process_sv_common.sh -s $sample_name \
-i $input_wgs_bam \
-o sv_out/$sample_name \
-r $reference_genome \
-t $threads \
-c lumpy,delly,manta,svabaIdentify candidate inverted splicing events based on SV breakpoints.
# Create output directory
mkdir -p candidate_out
# Run candidate detection
bash scripts/full_pipe/preprocess/combine_sv_rna.sh
$sample_name \
sv_out/$sample_name \
rna_out/$sample_name/hisat.map_unmapremap.s.bam \
rna_out/$sample_name/still_unmap_bwa.s.bam \
candidate_out \
$gene_annotation_bed \
delly,manta,lumpy,svabaGene annotation files provided:
- hg19:
scripts/full_pipe/annotation/hg19_genes.gencode.bed - hg38:
scripts/full_pipe/annotation/hg38_genes.gencode.bed
Perform conjugate graph-based assembly to reconstruct transcripts.
conda activate invas_assembly
# Create output directory
mkdir -p assembly_out/$sample_name
# Run assembly
python scripts/full_pipe/main.py \
--input_dir candidate_out/res/$sample_name \
--sample_name $sample_name \
--extend_region 5000 \
--output_dir assembly_out/$sample_name \
--ref_genome $reference_genome \
--rna_map_with_remap_bam rna_out/$sample_name/hisat.map_unmapremap.s.bam \
--rna_unmap_bwa_bam rna_out/$sample_name/still_unmap_bwa.s.bam \
--wgs_bam $wgs_bam \
--with_normal_hapsThe assembly results are organized as follows:
assembly_out/sample_name/
├── GENE1/ # Gene-specific assembly results
│ ├── *_final_inv.bed # Records the locations of inversions
│ ├── *_final_inv.exons # Records the fragments involved in inversions
│ ├── haps/ # Folder containing assembled haplotype sequences
│ │ ├── hap1.fa # Assembled sequence for haplotype 1
│ │ ├── hap2.fa # Assembled sequence for haplotype 2
│ │ └── ... # Additional haplotype sequences (if any)
│ └── summary.txt # Assembly statistics for this gene
├── GENE2/
│ └── ...
├── ...
└── candidate_gene.txt # Summary of all candidate genes with inversions
Each gene folder contains:
- Assembled transcript sequences in FASTA format
- Both normal and inverted isoforms
- Assembly quality metrics
This mode does not require WGS data. Candidate inversions are inferred solely from RNA-seq signals (InvasDB, discordant alignments, unmapped read re-alignment).
Same as WGS+RNA mode:
conda activate invas_data_prep
mkdir -p rna_out/$sample
bash scripts/full_pipe/preprocess/process_rna_common.sh \
-n $sample \
-b $input_rna_bam \
-o rna_out/$sample \
-r $ref \
-x $hisat_index \
-t $threads \
-p scripts/full_pipe/bin/picard.jar
mkdir -p candidate_out
bash scripts/full_pipe/preprocess/combine_rna_only.sh \
$sample \
rna_out/$sample/hisat.map_unmapremap.s.bam \
rna_out/$sample/still_unmap_bwa.s.bam \
candidate_out \
$gene_bed
conda activate invas_assembly
mkdir -p assembly_out/$sample
python scripts/full_pipe/main_rna.py \
--input_file candidate_out/rna_only/${sample}_inv_in_gene_dedup_highcof.txt \
--sample_name $sample \
--extend_region 5000 \
--output_dir assembly_out/$sample \
--ref_genome $ref \
--ref_version hg38 \
--rna_map_with_remap_bam rna_out/$sample/hisat.map_unmapremap.s.bam \
--rna_unmap_bwa_bam rna_out/$sample/still_unmap_bwa.s.bam \
--with_normal_hapsBefore running INVAS on your own data, we recommend testing the installation with the provided example scripts:
# Navigate to the test directory
cd INVAS/test
# Review the example scripts for different scenarios:
ls -la *.sh
# These test scripts demonstrate:
# - Complete pipeline execution
# - Parameter settings for different data types
# - Output structure and interpretation
A complete example workflow for processing sample "3-06PB4872":
#!/bin/bash
# Set up variables
sample_name="3-06PB4872"
threads=16
# Reference files (hg19 example)
ref="/path/to/references/hg19.fa"
hisat_ref="/path/to/references/hisat2_index/hg19"
gene_bed="scripts/full_pipe/annotation/hg19_genes.gencode.bed"
# Input files
rna_fq1="/path/to/data/${sample_name}_R1.fastq.gz"
rna_fq2="/path/to/data/${sample_name}_R2.fastq.gz"
wgs_bam="/path/to/data/${sample_name}_wgs.bam"
# Output directories
base_dir="/path/to/output"
rna_dir="$base_dir/rna_out"
sv_dir="$base_dir/sv_out"
candidate_dir="$base_dir/candidate_out"
assembly_dir="$base_dir/assembly_out"
# Step 1: RNA-Seq Data Processing (Preliminary)
echo "Step 1: Processing RNA-seq data..."
conda activate invas_assembly
mkdir -p $rna_dir/$sample_name
# Initial alignment with HISAT2
hisat2 -p $threads -x $hisat_ref \
-1 $rna_fq1 -2 $rna_fq2 | \
samtools sort -@ $threads -o $rna_dir/$sample_name/hisat2.bam
# Process with Invas RNA pipeline
bash scripts/test_common_run.sh \
-n $sample_name \
-b $rna_dir/$sample_name/hisat2.bam \
-o $rna_dir/$sample_name \
-r $ref \
-x $hisat_ref \
-t $threads \
-p tools/picard.jar
# Step 2: WGS Data Processing and SV Detection (Preliminary)
echo "Step 2: Detecting structural variants..."
conda activate invas_data_prep
mkdir -p $sv_dir/$sample_name
# Extract MHC region (optional, for focused analysis)
samtools view $wgs_bam chr6:28477797-33448354 -O BAM \
-o $sv_dir/$sample_name/${sample_name}_mhc.bam
samtools index -@ $threads $sv_dir/$sample_name/${sample_name}_mhc.bam
# Run SV detection
bash scripts/process_sv_common.sh \
-s $sample_name \
-i $sv_dir/$sample_name/${sample_name}_mhc.bam \
-o $sv_dir/$sample_name \
-r $ref \
-t $threads \
-c lumpy,delly,manta,svaba
# Step 3: Candidate Intragenic Inversion Event Detection (Preliminary)
echo "Step 3: Detecting candidate inversions..."
mkdir -p $candidate_dir
bash scripts/full_pipe/preprocess/combine_sv_rna.sh $sample_name \
$sv_dir/$sample_name \
$rna_dir/$sample_name/hisat.map_unmapremap.s.bam \
$rna_dir/$sample_name/still_unmap_bwa.s.bam \
$candidate_dir \
$gene_bed \
delly,manta,lumpy,svaba
# Step 4: Core Transcriptome Assembly
echo "Step 4: Assembling transcripts..."
conda activate invas_assembly
mkdir -p $assembly_dir/$sample_name
python scripts/full_pipe/main.py \
--input_dir $candidate_dir/res/$sample_name \
--sample_name $sample_name \
--extend_region 5000 \
--output_dir $assembly_dir/$sample_name \
--ref_genome $ref \
--rna_map_with_remap_bam $rna_dir/$sample_name/hisat.map_unmapremap.s.bam \
--rna_unmap_bwa_bam $rna_dir/$sample_name/still_unmap_bwa.s.bam \
--wgs_bam $wgs_bam \
--with_normal_haps
echo "Analysis complete! Results in: $assembly_dir/$sample_name"For processing multiple samples:
#!/bin/bash
# Sample list file (one sample per line)
samples_file="samples.txt"
threads=16
# Reference files
ref="/path/to/references/hg19.fa"
hisat_ref="/path/to/references/hisat2_index/hg19"
gene_bed="scripts/full_pipe/annotation/hg19_genes.gencode.bed"
# Base directories
base_dir="/path/to/output"
# Process each sample
while IFS= read -r sample; do
echo "Processing sample: $sample"
# Run complete pipeline for each sample
bash run_invas_pipeline.sh \
--sample $sample \
--threads $threads \
--ref $ref \
--hisat_ref $hisat_ref \
--gene_bed $gene_bed \
--output_dir $base_dir
done < "$samples_file"This step detects RNA-supported intragenic inversions and generates *inv_in_gene_dedup_highcof.txt.
conda activate invas_preprocess
bash scripts/full_pipe/preprocess/preprocess_onlyrna.sh \
inv_input.bed \
hg38_genes.bed \
still_unmap_bwa.s.bam \
hisat.map_unmapremap.s.bam \
candidate_out/rna_only/sample1
candidate_out/rna_only/sample1/
├── chr1_inv_in_gene_dedup_highcof.txt
├── chr2_inv_in_gene_dedup_highcof.txt
├── ...
Optionally merge all chromosomes:
cat candidate_out/rna_only/sample1/chr*_inv_in_gene_dedup_highcof.txt \
> candidate_out/rna_only/sample1_inv_in_gene_dedup_highcof.txt
conda activate invas_assembly
mkdir -p assembly_out/sample1
python scripts/full_pipe/main_rna.py \
--input_file candidate_out/rna_only/sample1_inv_in_gene_dedup_highcof.txt \
--sample_name sample1 \
--output_dir assembly_out/sample1 \
--ref_genome hg38.fa \
--ref_version hg38 \
--rna_map_with_remap_bam rna_out/sample1/hisat.map_unmapremap.s.bam \
--rna_unmap_bwa_bam rna_out/sample1/still_unmap_bwa.s.bam \
--with_normal_haps
INVAS provides two running modes depending on data availability and quality.
Use WGS+RNA mode when:
- ✅ High-quality WGS data is available
- ✅ Matched RNA-seq data is available
- ✅ You want high-confidence inversion breakpoint detection
- ✅ You want maximum accuracy in haplotype reconstruction
This mode integrates:
- Precise breakpoint evidence from WGS
- Transcript-level support from RNA-seq
➡️ This is the recommended mode whenever good WGS data is available.
Use RNA-only mode when:
- ✅ Only RNA-seq data is available
- ✅ WGS data is missing or of low quality
- ✅ WGS coverage is insufficient for reliable breakpoint calling
In this mode:
-
Inversion candidates are inferred ** from RNA-seq and InvasDB evidence**
-
Final output format is identical to WGS+RNA mode
| Data Condition | Recommended Mode |
|---|---|
| High-quality WGS + RNA | ✅ WGS + RNA |
| RNA only | ✅ RNA-only |
| Low-quality WGS + RNA | ✅ RNA-only |
INVAS supports two running modes with very different resource requirements.
Resource usage in WGS+RNA mode depends strongly on WGS sequencing depth and genome coverage.
- Memory
- ~30–60 GB RAM for typical WGS (30×)
- ≥100 GB RAM only required for very deep WGS or large batch runs
- CPU
- Multi-core processor recommended (16+ cores for reasonable runtime)
- Storage
- ~50 GB per sample for intermediate files (WGS + RNA)
- ~20 GB for reference genome and indices
- Operating System
- Linux (tested on Ubuntu 20.04, CentOS 7)
➡️ Recommended when high-quality WGS data is available, and sufficient memory can be allocated.
RNA-only mode is lightweight and suitable for machines with limited resources.
- Memory
- Tens to hundreds of MB RAM for whole-transcriptome runs
- Even lower when using
--target_gene
- CPU
- 4–8 cores are sufficient
- Storage
- A few GB per sample for RNA-seq intermediates
- Operating System
- Linux (tested on Ubuntu 20.04, CentOS 7)
➡️ Ideal when only RNA-seq data is available or WGS quality is insufficient.
- Using
--target_genesignificantly reduces memory and runtime in both modes - RNA-only mode is recommended for quick validation and exploratory analysis
Common issues and solutions:
-
Memory errors during assembly
- Increase memory allocation in SLURM script
- Reduce
--extend_regionparameter
-
Missing SV calls
- Ensure all SV callers completed successfully
- Check BAM file integrity with
samtools quickcheck
-
No candidate inversions found
- Verify gene annotation file matches reference genome version
- Check SV caller outputs for detected variants
This software is released under the [MIT] License. See LICENSE file for details.
For questions, bug reports, or feature requests:
- Email: [xuedowang2-c@my.cityu.edu.hk]
- GitHub Issues: [https://github.com/deepomicslab/INVAS/issues/new]
Invas incorporates several published tools:
- HISAT2 for RNA-seq alignment
- Manta, Delly, Lumpy, and Svaba for SV detection
- Picard for BAM file processing