Aletsch implements an efficient algorithm to assemble multiple RNA-seq samples (or multiple cells for single-cell RNA-seq data). The datasets and scripts used to compare the performance of Aletsch with other assemblers are available at aletsch-test.
Aletsch can be installed through conda or by compiling source (see INSTALLATION).
The usage of aletsch
is:
./aletsch -i <input-bam-list> -o <output.gtf> [options]
We highly recommend to generate profiles for individual samples first:
./aletsch --profile -i <input-bam-list> -p <profile>
./aletsch -i <input-bam-list> -o <output.gtf> -p <profile> -d <gtf> [options]
Note: The directory <profile>
and <gtf>
should exist before execution.
Each line of input-bam-list
describes a single sample, with 3 fields separated by space.
The 3 fields are: alignment-file
(in .bam format), index-alignment-file
(in. bai format), and protocol
.
The index-file
can be generated using samtools (e.g., samtools index ...
).
The protocol
is chosen from the 5 options: single_end
(for illumina single-end RNA-seq protocol),
paired_end
(for illumina paired-end RNA-seq protocol),
pacbio_ccs
(for PacBio Iso-Seq CCS reads),
pacbio_sub
(for PacBio Iso-Seq sub-reads),
ont
(for Oxford Nanopore RNA-seq).
Aletsch will use different parameters / algorithms to process different data types.
Aletsch requires that each input alignment file is sorted; otherwise run samtools
to sort it (samtools sort input.bam > input.sort.bam
).
The assembled transcripts from all these samples will be written to output.gtf
, in standard .gtf format.
Aletsch provides several options for transcript assembly, supporting both its unique parameters and those required by the core algorithm of Scallop. For a detailed list, execute aletsch
without arguments.
Parameters | Type | Default Value | Description |
---|---|---|---|
--help | Displays Aletsch usage information and exits. | ||
--version | Shows Aletsch version information and exits. | ||
--profile | Profiles individual samples and exits. Writes to files if -p is specified. |
||
-l | string | Specifies chromosomes to assemble. | |
-L | string | Specifies a file containing a list of chromosomes to assemble. | |
-d | string | Output directory for individual sample transcripts. Directory must exist prior to execution. | |
-p | string | Directory for reading/saving individual sample profiles. Directory must exist prior to execution. | |
-t | integer | 10 | Number of threads. |
-c | integer | 200 | Maximum number of splice graphs in a cluster, recommended as twice the number of samples. |
-s | float | 0.2 | Minimum similarity for combining two splice graphs. |
- If
-l string
or-L file
option is provided, Aletsch assembles only the specified chromosomes; otherwise, it assembles all chromosomes.
- Directories specified by
-d
and-p
must exist before running Aletsch; the tool does not create directories. - With
--profile
, Aletsch infers profiles of individual samples, using theXS
tag from input BAM files.
Aletsch employs a random forest model for scoring transcripts, available for download from Zenodo. Use the provided Python script score.py
with this model.
Required Python libraries: numPy, pandas, scikit-learn, joblib
-
Using pip:
pip install numpy pandas scikit-learn joblib
-
Using conda (recommended):
conda install numpy pandas scikit-learn joblib
Score transcripts with the syntax below:
python3 score.py -i <individual_gtf_dir> -m <pretrained_model.joblib> -c <num_of_samples> -p <min_probability_score> -o <output_score.csv>
Parameter | Type | Default | Description |
---|---|---|---|
-i |
String | Directory containing Aletsch's feature files(x.trstFeature.csv). This is the same directory where Aletsch outputs individual GTF files, as designated by the -d option in Aletsch's assembly process. |
|
-m | String | Path to the pre-trained model file for scoring. | |
-c | Integer | Number of samples/cells | |
-p | String | 0.2 | Minimum probability score threshold (range: 0 to 1). |
-o | String | Output directory of scored .csv file. |
Assuming a collection of <individual_gtf_dir>
contains a total of 0.trstFeature.csv
through to n.trstFeature.csv
. Files 0.trstFeature.csv
to (n-1).trstFeature.csv
correspond to feature files for individual samples, sequentially from the first to the last sample. The file n.trstFeature.csv
is derived from the combined graph.