Skip to content

A C++ package that enables the bulk retrieval, processing, and assembly of transcriptome datasets

License

Notifications You must be signed in to change notification settings

karolisr/semblans

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semblans

Author: Miles Woodcock-Girard Walker Lab, UIC Department of Biological Sciences

                  As-Sembl-y pipeline for tr-ans-criptomes

Semblans is a tool that enables the automatic assembly of de novo transcriptomes for non-model organisms.

The easiest way to install Semblans is to download the latest the pre-built binaries here

Through the integration of several external packages and the leveraging of C++ data streaming performance, Semblans streamlines the necessary pre-processing, quality control, assembly, and post-assembly steps, allowing a hands-off assembly process without loss to versatility. The following diagram shows a graphical workflow of the pipeline:

All documentation for Semblans can be found in the wiki

Dependencies

Semblans will install most of the dependencies it requires, but make sure you have working installations of:

On Ubuntu this can be done by running:

sudo apt update
sudo apt install bowtie2 jellyfish salmon samtools python3-numpy

Installation

The easiest way to install Semblans is to download the latest the pre-built binaries here.

If instead the user wishes to build from source, they must clone this repository, navigate to the Semblans root directory and then call:

./install.sh

Please allow several minutes for Semblans to set up the necessary packages.

By default, Semblans will not retrieve the PantherDB functional protein database for sequence annotation. **If the user intends to utilize Semblans' annotation functionality, they should instead call the following installation command:

./install.sh --with-panther

Be aware that the PantherDB database is large (~17GB compressed; ~80GB uncompressed), and can take some time to download.

Quick Start / Test data

Included with Semblans is a directory called 'examples'. This directory contains a very small short read dataset ("ChloroSubSet") for testing/verifying functionality of the Semblans pipeline. To test, uncompress the data from ChloroSubSet.tar.gz. The user should then ensure they have a reference proteome, as one is necessary for several of the pipeline's postprocessing stages. Links to broad, kingdom-level reference proteomes are hosted at the bottom of this document. In this example, I use the kingdom-level plant proteome. Once prepared, the user may call:

semblans \
--left ChloroSubSet_1.fq \
--right ChloroSubSet_2.fq \
--prefix ChloroSubSet \
--ref-proteome ensembl_plant.pep.all.fa \
--threads 4 \
--ram 10

Some users may experience issues, particularly during the transcript assembly phase during Trinity. Common errors and solutions are hosted on our GitHub's wiki page. As cataloguing these is an ongoing process, we urge users to post an issue on the Semblans repository page detailing their problem if it persists or is otherwise unaddressed by this page.

Reference peptide sets (gzipped FASTA) for the postprocess step:

Ensembl animal reference (3.1 GiB) [Option 1 | Option 2]

Ensembl fungi reference (4.7 GiB) [Option 1 | Option 2]

Ensembl plant reference (2.1 GiB) [Option 1 | Option 2]

About

A C++ package that enables the bulk retrieval, processing, and assembly of transcriptome datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • C++ 92.3%
  • Shell 5.0%
  • Makefile 2.1%
  • Dockerfile 0.6%