Skip to content
This repository has been archived by the owner on Jul 20, 2023. It is now read-only.
/ ContaVect Public archive

Complete suite to analyse DNA contaminants of virus/vector preparation from NGS data

License

Notifications You must be signed in to change notification settings

a-slide/ContaVect

Repository files navigation

#ContaVect

see GitHub Page

Motivation

Contavect is a python2.7 object oriented script, developed to quantify and characterize DNA contaminants from gene therapy vector production after NGS sequencing. This automated pipeline can however be used for wider purpose requiring to identify map NGS datasets consisting of a mix of DNA sequences on multiple references. It combine several features such as reference homologies masking, fastq filtering/adapter trimming, short read alignments, SAM file splitting and generating human readable output.

##Principle

Contavect a python pipeline composed of several modules linked together to analyze NGS Data. Here is a description of the overall workflow principle :

  1. Each reference fasta file is parsed to identify all sequences within it and a Reference object is initialized to save the reference characteristics, the name and the output required.
  2. Facultative: Homologies between references can be masked iteratively, starting by the last reference which is masked by all the others then to the penultimate masked by all others except the last and and so forth until there is only 1 reference remaining. This is done using blastn from blast+ package
  3. Facultative: Fastq can be filtered by mean quality and adapters can be trimmed using an homemade fully integrated fastq filter parallel processing module written in python and C.
  4. If needed an index for bwa will be generated from the modified reference files or from the original one after being merged together in a temporary directory. Then Fastq sequences are then aligned against the bwa merged reference genome index and a temporary sam file is generated
  5. Aligned reads from the sam file are spitted and attributed to the reference Object for which a hit was found. or to one of the following garbage reads categories: unmaped, lowMapq, secondary.
  6. Each reference will then generates the output required in the configuration file (Bam, sam, bedgraph, bed and covgraph).
  7. Finally distribution reports and a log file are generated

ContaVect Design

For more information, a comprehensive developer documentation can be generated from ContaVect.dox using Doxygen with doxypy.

Dependencies:

The program was developed under Linux Mint 16/17 and require a python 2.7 environment. The following dependencies are required for proper program execution:

In addtion 2 third party python packages are also needed

If you have pip already installed, enter the following line to install pysam:

sudo pip install pysam

Get ContaVect

  1. Clone the repository with --recursive option to also pull the submodule
$ git clone --recursive https://github.com/a-slide/ContaVect/ my_folder/
  1. Enter the root of the program folder and make the main script executable
$ sudo chmod u+x ContaVect.py

3.Compile the ssw aligner (and add the dynamic library it to the path)

If you wish to perform a step of adapter trimming before mapping you need to complile the dynamic library ssw.so to be able to use the Smith Waterman algorithm forked from mengyao's Complete-Striped-Smith-Waterman-Library

To use the dynamic library libssw.so you may need to modify the LD_LIBRARY_PATH environment variable to include the library directory (export LD_LIBRARY_PATH=$PWD) or for definitive inclusion of the lib edit /etc/ld.so.conf and add the path or the directory containing the library and update the cache by using /sbin/ldconfig as root

Usage

Prepare the configuration file to include your files and settings as indicated in the template Conf.txt file provided with the source files

$ ./ContaVect.py Conf.txt 

No command line option is available, everything is in the Configuration file

Export the path to conf/matplotlibrc file as the MATPLOTLIBRC variable for headless cluster runs.

Development notebook

2 possibilities:

  • Use ipython notebook with doc/Logbook.ipynb
  • Consult directly online through nbviewer : Notebook

Authors and Contact

About

Complete suite to analyse DNA contaminants of virus/vector preparation from NGS data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages