CREST User guide

How to use the LCAClassifier

Classification algorithm

Taxonomical classification starts with alignment to a reference sequence database (such as SilvaMod or Greengenes) using the NCBI Blast+ or legacy blastall implementation of Megablast. The LCAClassifier requires that output is saved in XML format. The classification is then carried out based on a subset of the best matching alignments using the Lowest Common Ancestor (LCA) of this subset. Briefly, the subset includes sequences that score within x% of the “bit-score” of the best alignment, providing the best score is above a minimum value. Default values for the minimum bit-score is 155 and for the LCA range (x) 2%. Based on cross-validation testing using the non-redundant SilvaMod database, this results in relatively few false positives for most datasets. However, the LCA range can be increased up to about 10% (using the -r option), to increase accuracy with short reads and for datasets with many novel sequences.

In addition to LCA classification, a minimum similarity filter is used, based on a set of rank-specific requirements. Firstly, a sequence must be aligned with at least 99% nucleotide similarity to the best reference sequence in order to be classified to the species rank. For the genus, family, order, class and phylum ranks the respective cut-offs are 97%, 95%, 90%, 85% and 80%. This filter ensures that classification is made to the taxon of the lowest allowed rank, effectively re-assigning sequences to parent taxa until allowed.

Preparing sequences

For amplicon sequences, we strongly recommend noise reduction using e.g AmpliconNoise (for 454 Pyrosequencing data); vsearch or UPARSE for Illumina MiSeq and other platforms; as well as chimera removal (using e.g. Perseus or uchime ) prior to submission.

Preparing an OTU table

For amplicon sequencing experiments with many replicates or similar samples (>~10), we highly recommend that the datasets are clustered into Operational Taxonomic Units (OTUs), prior to classification. OTUs can either be defined as unique sequences (100 % identity) shared between datasets, or representative sequences for a cluster of similar sequences (often 97% identity with maximum- or average linkage is used). OTU clustering is carried out by default in all of the denoising pipelines mentioned above. The LCAClassifier can then use the distribution data of OTUs across datasets, in order to determine weighted relative abundance for each taxon across datasets, as well are OTU richness and a Chao-estimate of total richness.

An OTU table defines the distribution of an OTU across datasets and can be in delimeted text format or in BIOM-format (the later is default in QIIME and the former in many other applications). Make sure to divide datasets into combinations that "make sense", i.e. that you want to analyse together, before OTU clustering. For example, dividing data from a sequencing run of two separate experiements, or merging together data from two sequencing runs belonging to the same experiment. Also make sure sequence / OTU names in the OTU table correspond to those used in the FASTA file with representative sequences, to be aligned.

Using individual sequence datasets

In some cases, OTU clustering is not possible, e.g. when classifying data from shotgun transcriptomic or metagenomic sequencing experiments. For amplicon data, it may be adventagous to avoid OTU clustering in some cases. Even for separate sequence datasets, not clustered together into OTUs, the LCAClassifier will report both the weighted read abundance and number of unique sequences (i.e. richness) for each taxon if the sequence names are annotated with read abundance, i e the number of reads that each sequence represents. This annotation is produced automatically by de-noising the data with AmpliconNoise and follows a simple format; the name for each sequence in the FASTA header ends with _N, where ''N'' represents the number of reads. Alternative accepted formats are weight=N or reads=N.

Aligning sequences using Megablast

First download and unpack the NCBI blastall suite from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ and place the binary for Megablast (and the other programs) in your PATH. Then use Megablast to align your environmental SSU rRNA sequences to the Silvamod or Greengenes reference databases that are installed automatically with the LCAClassifier. Other marker genes can also be used, but require that the construction of a custom database (see below).

Silvamod and Greengenes are installed in the directory where LCAClassifier was installed, under parts/flatdb.

Example

To align the sequences of fasta file dataset1.fa to the Silvamod database, providing LCAClassifier was installed in the home directory, use, for legacy Megablast:

megablast -i dataset1.fa -d ~/LCAClassifier/parts/flatdb/silvamod/silvamod128.fasta -b100 -v100 -m7 -o dataset1_silvamod.xml

or, for the BLAST+ suite:

blastn -task megablast -query dataset1.fa -db ~/LCAClassifier/parts/flatdb/silvamod/silvamod128.fasta -num_alignments 100 -outfmt 5 -out dataset1_silvamod.xml

Alignments will be written to the file dataset1_silvamod.xml.

Classification

The executable for the LCAClassifier is called classify and uses a very simple command line interface:

classify [options] alignment-file(s).xml

To classify the alignment file dataset1_silvamod.xml using default parameters, simply use:

classify dataset1_silvamod.xml

Silvamod is the default reference database. In case greengenes was used instead, the -d option must be used to specify the database name:

classify -d greengenes dataset1_greengenes.xml

Let’s say that dataset1.fa contains representative OTU sequneces. Normally, we would be more interested in the taxonomic composition across each individual sample included in that dataset. The information of how OTUs are distributed will then be contained in an OTU-table, for example called dataset1_OTUs.csv. To incorporate this information, use:

classify -t dataset1_OTUs.csv dataset1_silvamod.xml

If the OTU table is instead in BIOM-format, we use the -b option:

classify -b dataset1_OTUs.biom dataset1_silvamod.xml

In both cases, a new OTU-table, in the same format and with the same name as given, will be written to the output directory, with taxonomic annotations added (for text-format as an extra column and for BIOM as observation metadata using the key taxonomy. We can also write annotated fasta-file, if the representative sequences are provided (option -i):

classify -i dataset1.fa -b dataset1_OTUs.biom dataset1_silvamod.xml

To specify output directory name, use option -o:

classify -o dataset1_CREST_Out -i dataset1.fa -b dataset1_OTUs.biom dataset1_silvamod.xml

If several datasets, not clustered into common OTUs, were aligned to the reference database, classification can be carried out for alignment XML file at a time, or simultaneously for several files. The advantage of the later option is that rows will always be inserted in the composition output files for all taxa that are present in at least one of the datasets, even if they are not present in all. For example, to classify dataset1_silvamod.xml.gz and dataset2_silvamod.xml.gz simultaneously, use:

classify dataset1_silvamod.xml.gz dataset2_silvamod.xml.gz

Note that gzip-compressed alignment files can be used - practical for avoiding to use a lot of disk space!

Output

The output of LCAClassifier is written to the directory specified by option -o (by default CREST_Results. It produces at least two output files for each sample dataset classified. If an OTU-table is provided, the output files are named after sample names used in it. If not, they are named using the XML alignment files from which they were derived.

Sample-specific files

For example, for a (compressed) alignment file named Dataset.xml.gz and no specified OTU-table, LCAClassifier will write the following files:

Dataset_Composition.tsv<br/> Written by default, this file contains the taxonomical composition of the dataset in tab-separated text format, designed for reading in a spreadsheet editor like Open Office (or Excel). Composition is given separately for each rank from domain to species, starting with a meta-level giving the total number of reads, classified and unclassfied reads. For each taxon, the total number of reads, its relative abundance, number of unique sequences and a chao estimate of total unique sequences is given. For each rank, the total number of reads assigned to that rank or better is also given.

Dataset_Tree.txt<br/> Written by default, this file shows the taxonomical composition in a simple space-indented tree (in plaintext format), each taxon annotated with the number of reads.

Dataset_Assignments.fasta<br/> Written by using option -a, this file gives all taxonomically annotated sequences in FASTA format. In the FASTA header, the original sequence name is first given, then the predicted taxonomic assignment from the root of the taxonomic tree to the best rank at which classification was possible (separated by semicolons). If the Minimum similarity filter prevented assignment at a higher level for this sequence, indicating it to represent a novel taxa, the last taxon is prefixed with the word "Unknown".

Dataset_Assignments.tsv<br/> Written by using option -p, this file lists all sequence names and their assignments in a simpe semicolon-separated format. If the Minimum similarity filter prevented assignment at a higher level for this sequence, indicating it to represent a novel taxa, the last taxon is prefixed with the word "Unknown".

Common / dataset-specific output files

All_Assignments.tsv This file provides counts of the number of assignments at each taxonomic rank for all datasets in a tab-separated format. Note that only assignments to the taxon node itself are counted, not to child taxa at lower ranks. For each taxon, the full taxonomic path from root to the taxon itself is also given. This file is more suitable for parsing than Dataset_Composition.txt.

All_Cumulative.tsv This file provides cumulative counts for the number of assignments at each taxonomic rank for all datasets in a tab-separated format. As opposed to All_Assignments.tsv, assignments to child taxa (lower ranks) are also counted. This file is more suitable for parsing than Dataset_Composition.txt.

Relaitve_Abundaces.tsv relative abundance data across datasets normalised to the total number of assigned reads in each dataset. Option -m can be used to filter this output from the least abundant taxa. This is recommended particularly when datasets of different sequencing depths (total number of reads), will be compared.

Richness.tsv number of unique OTUs for each taxon

<Sequence-input-file-name> (if given) +"_Assigned.fasta" sequences with assignments added in FASTA header. Written with option -i

Download example output files

Other options

-h: shows a help message listing all options

-a: BI-weights-file calculates biotic index using the submitted file specifying the index weights (e.g. microgAMBI for 16S)

-r: LCA bitscore range in percent (given as an integer >0; default = 2)

-s: Minimum bit-score for classification (integer >0; default = 155)

-n: Normalise relative abundances to classified sequences only (ignoring unclassified)

-f: De-activate the minimum similarity filter

-i fasta-file: By default sequences in the taxonomically annotatated FASTA-output (_Assignments.fasta) are taken directly from the alignment XML file. This means that un-aligned sequence stretches or completely un-aligned sequences are ignored. This option instead uses the sequences from the supplied FASTA-file

-q qual-file: This option works like -i, but instead accepts a file with Phred quality scores in fasta.qual format and outputs an annotated .qual file.

-v: Use Verbose mode. Outputs information about interesting assignments to standard out.

-c: Selects which configuration file to use (by default, the file parts/etc/lcaclassifier.conf in the LCAClassifier’s installation directory is used)

Constructing a custom reference database

Starting from a reference alignment of a phylogenetic marker gene annotated with taxonomical information, the script nds2CREST.py can create a custom reference database for use with the LCAClassifier. We recommend using the program ARB to create and annotate such an alignment from a set of sequences.

nds2CREST.py requires three different input files:

Sequences of the reference marker genes in the alignment, in FASTA format (sequences.fasta in the example below). It is important that the sequences are cropped so that stretches of unaligned sequence is not included. Such sequences may bias the alignments during classification. In ARB, this can be done by using a positional filter when exporting sequences.
A tab-separated text file (in ARB language "NDS file") containing the taxonomic annotations of each aligned sequence (taxa.nds in the example below). This file must contain three columns separated by tabs. The first column is the accession number corresponding to the sequence file. The second contains the taxonomic structure with ranks separated by slash, semicolon or underscores (e.g. Domain;Phylum;Class). The third column contains the species or strain name or a unique description of the sequence. This file can be exported from ARB by first setting up NDS display correctly in the menu Tree > NDS (Node Display Setup). For SILVA, check the leaf box for "acc", then for group for "tax_slv" and for leaf for "full name". Then write the file using File > Export > Export fields using NDS. Make sure to tick the checkbox marked "Use TABs for columns"!
- Example: A16379 Bacteria/Proteobacteria/Gammaproteobacteria_1/Pasteurellales_Pasteurellaceae/Haemophilus Haemophilus ducreyi
A tab-separated text file containing the changes to be made to the taxonomic structure (MCF.txt in the example below). These changes can also be used as a proof-reading; in case they were already carried out, the script simply confirms them and if not they are carried out. To discard this option, use an empty "manual changes file" (MCF). Download an example describing the file format of the MCF

The script nds2CREST is then run using (replace "$LCADir" with the installation direcrory of the LCAClassifier):

$LCADir/bin/python $LCADir/src/LCAClassifier/nds2CREST.py database-name sequences.fasta MCF.txt taxa.nds

This may result in a list of warnings being written to standard.error if for example a sequence included sequences.fasta was not in the taxa.nds file or vice versa, or if duplicate names of taxa appeared in different topological context. In the former case, the sequence will be ignored. In the later, a parent suffix will be added to the sequence, for example "Cryptococcus", which is the name of an insect genus as well as a fungal one, will become "Cryptococcus (Eriococcidae)". The same happens if the same name is used for two different ranks, e.g. "Bacteria;Actinobacteria;Actinobacteria" will become "Bacteria;Actinobacteria (phylum);Actinobacteria (class)".

After successful completion, nds2CREST will create three new files:

# *database-name.tre:* the taxonomy of the reference database in http://en.wikipedia.org/wiki/Newick_format[Newick format], each taxon represented by a unique numeral identifier
# *database-name.map:*  a mapping file, mapping the taxa identifiers to their full names and taxonomic rank
# *database-name.fasta:* the reference database in FASTA-format (simply the same as sequences.fasta but with un-mapped sequences removed).

These files are needed by the LCAClassifier in order to make taxonomic assignments and the .tre and .map files for Silvamod can be found in ~/LCAClassifier/LCADir/parts/flatdb/silvamod/silvamod.map or silvamod.tre if LCAClassifier was installed in the $HOME directory (~). To add your own custom database with the somewhat unimaginative name "database-name" as used in the examples above, create a new directory named "database-name" under the directory flatdb and copy these two files there:

mkdir ~/LCAClassifier/LCADir/parts/flatdb/database-name

cp database-name.* ~/LCAClassifier/LCADir/parts/flatdb/database-name

You also need to format the FASTA-file for Megablast (or BLAST) using the command formatdb (installed with the blastall suite):

cd ~/LCAClassifier/LCADir/parts/flatdb/database-name

formatdb -i database-name.fasta -pF

The final change you need to do before using your own reference database is to tell the configuration file of the LCAClassifier where to find it. Edit the file ~/LCAClassifier/parts/etc/lcaclassifier.conf and add a new line. For example:

database-name = /Home/user/LCAClassifier/parts/flatdb/database-name

Note that this will be overwritten if you update the LCAClassifier or re-build it. To make the change permanent, also add it in the file ~/LCAClassifier/etc/lcaclassifier.conf.in
As you notice, you can keep your reference database anywhere you want in the file system, not necessarily in the LCAClassifier installation directory, as long as you make the correct changes in lcaclassifier.conf

To use your new reference database:

megablast -i env.fa -d ~/LCAClassifier/parts/flatdb/database-name/database-name.fasta -b100 -v100 -m7 -o env_custom.xml

classifiy -d database-name env_custom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly