Skip to content

Latest commit

 

History

History
127 lines (90 loc) · 6.52 KB

README.md

File metadata and controls

127 lines (90 loc) · 6.52 KB

ContFree-NGS.py

A very simple filter, open source software that removes sequences from contaminating organisms in your NGS dataset based on a taxonomic classification file.

Note for users: I appreciate your interest in ContFree-NGS. However, it's worth noting that there are more advanced alternatives currently available. I strongly recommend considering KrakenTools, which offers enhanced individual scripts for the analysis of Kraken/Kraken2/Bracken/KrakenUniq output files.

Requirements

  • python >= 3.8.5
  • ETE Toolkit >= 3.1.2
  • biopython >= 1.78

Installation

Get ContFree-NGS from GitHub

cd ~
git clone https://github.com/labbces/ContFree-NGS.git 

Python dependencies

Install ETE Toolkit (ete3):

pip install ete3

Install Biopython:

pip install biopython

Usage

Opening the help page:

./ContFree-NGS.py -h

usage: ContFree-NGS.py [-h] --taxonomy <taxonomy file> --sequencing_type, --s <p or s> --R1 <R1 file> [--R2 <R2 file>] --taxon <Taxon> [--v]

Removing reads from contaminating organisms in Next Generation Sequencing datatasets

optional arguments:
  -h, --help            show this help message and exit
  --taxonomy <taxonomy file>
                        A taxonomy classification file
  --sequencing_type, --s <p or s>
                        paired-end (p) or single-end (s)
  --R1 <R1 file>        FASTQ file 1
  --R2 <R2 file>        FASTQ file 2
  --taxon <Taxon>       Only this taxon and its descendants will be maintained
  --v, --version        show program's version number and exit

There are four required parameters:

--taxonomy: Taxonomy classification file (output of kraken2 or other classification tool).

--sequencing_type: Use 'p' for paired-end reads or 's' for single-end reads.

--R1 and --R2: For paired-end reads use --R1 for read file 1 and --R2 for read file 2. If you are working with single-end reads, use --R1 for read files.

--taxon: The user must provide a target a taxon (e.g Viridiplantae), which only sequences labeled in this target taxon or its descendants will be maintained in the filtered file. Sequences that not belong to the target taxon will be discarded and sequences that were not labeled at any taxon will be kept in the unclassified file.

ContFree-NGS will process the NGS dataset and its taxonomic classification file in the following way:

a) The user generates a taxonomic classification file and run ContFree-NGS providing a target taxon.

b) ContFree-NGS creates an indexed database for the NGS dataset to reduce processing time;

c) Then, checks whether the labeled taxon for any sequence belongs to the target taxon or its descendants, generating filtered and unclassified files.

Note that the accuracy of ContFree-NGS contamination removal is directly dependent on the accuracy of the taxonomic classification engine, as ContFree-NGS uses the taxonomic label of each sequence to remove those that are from contaminants.

Example

To assess the contamination of a NGS dataset, ContFree-NGS exploits a taxonomic classification file containing a taxon ID (NCBI Taxonomic ID) for every sequence in the dataset. This taxonomic classification file can be generated with a taxonomic classification tool, such as Kraken2 or Kaiju.

We have prepared a artificially contaminated dataset for your first run, it is available at ContFree-NGS/data/. This dataset contains three files:

Check ContFreeNGS/data/README.md for more information about the artificially contaminated dataset.

Running ContFree-NGS in the contaminated dataset, keeping only taxons descendants of Viridiplantae

./ContFree-NGS.py --taxonomy data/artificially_contaminated.kraken --s p --R1 data/artificially_contaminated_1.fastq --R2 data/artificially_contaminated_2.fastq --taxon Viridiplantae 

This should print the following in your screen:

Indexing fastq files, please wait ... 

-------------------------------------------------
Contamination removal was successfully completed!
-------------------------------------------------
Viridiplantae descendants sequences: 410
Contaminant sequences: 128
Unlabelled sequences: 462
-------------------------------------------------
Viridiplantae descendants sequences are in the filtered files
Contaminant sequences were discarded
Unlabelled sequences are in the unclassified files

And should generate the files:

  • artificially_contaminated_1.filtered.fastq
  • artificially_contaminated_1.unclassified.fastq
  • artificially_contaminated_2.filtered.fastq
  • artificially_contaminated_2.unclassified.fastq

Runtime and RAM usage

ContFree-NGS runtime and RAM usage are described in the chart below:

Runtime and RAM usage

This figure shows the RAM usage and time consuming to remove contaminants for the three artificially contaminated datasets.

Reading and writing a file is an operation that takes considerable time. If you are working with big files, we recommend that you split the taxonomic classification file into smaller files. This can be done as follows:

split -l lines -d --additional-suffix=.taxonomic_file artificially_contaminated.kraken splitted_

This should split your large taxonomic classification file into small files with a determinated prefix (splitted_n), where 'n' is the number of small files.

Publication

Peres, F.V., Riaño-Pachón, D.M. (2021). ContFree-NGS: Removing Reads from Contaminating Organisms in Next Generation Sequencing Data. Advances in Bioinformatics and Computational Biology. BSB 2021. Lecture Notes in Computer Science, vol 13063. Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-91814-9_6