CNV-PG is an open-source application written in Python, including two parts: CNV predicting (CNV-P) and CNV genotyping (CNV-G). For CNV-P, we trained on a subset of validated CNVs from different CNV callers separately to obtain the corresponding classifier used for the identification of true CNVs. For CNV-G, a genotyper, which is compatible with existing CNV callers and generating a uniform set of high-confidence genotypes.
python3
sklearn
matplotlib
pysam
pandas
numpy
Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh -h" to see the usage information.The follow options are required:
-i BAMFILE, the path of BAM file(generated by bwa commonly)
-b BASFILE, the path of BAS file (provided by user)
-v VCFFILE, the path of VCF file(the results of CNVcallers (breakdancer,Delly,Lumpy,Manta or Pindel)
-p PYTHON, the path of python
-o OUTDIR, the results outdir
-n SAMPLENAME, the prefix of outputfile
-c CODE_PATH, the path of CNV-P code ($HOME/CNV-PG/CNV-P/)
-s CNVCALLER, the name of CNVcaller (breakdancer,Delly,Lumpy,Manta or Pindel)
In the above command, "BASFILE" needs to be created extra, the format show as follow:
for example:
bam_filename md5 study sample platform library readgroup #_total_bases #_mapped_bases #_total_reads #_mapped_reads #_mapped_reads_paired_in_sequencing #_mapped_reads_properly_paired %_of_mismatched_bases average_quality_of_mapped_bases mean_insert_size insert_size_sd median_insert_size insert_size_median_absolute_deviation #_duplicate_reads coverage
HG002 - HG002 HG002 ILLUMINA HG002 HG002 - - - - - -- - 569 95 568.177944 163.819637 - 35.41
the Columns of "sample", "mean_insert_size", "insert_size_sd" and "coverage" are required in the step of feature extraction.
Run "$HOME/CNV-PG/CNV-P/CNV-P_predict.sh" to classify candidate CNV:
$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
-i $HOME/CNV-PG/test_data/BAMFILE \
-b $HOME/CNV-PG/test_data/BASFILE \
-e $HOME/CNV-PG/test_data/VCFFILE \
-p $HOME/python/bin/python3 \
-o $HOME/OUTDIR \
-n SAMPLENAME \
-c $HOME/CNV-PG/CNV-P \
-s CNVCALLER
CNVCALLER.SAMPLENAME.fil.mer.bed # the results of candidate CNV Extract from VCF file
CNVCALLER.SAMPLENAME.feature.txt # the features matrix
CNVCALLER.SAMPLENAME.pre.prop.txt # the results of predicting which provide category and probability for each CNV
Similar to the CNV-P, Run "$HOME/CNV-PG/CNV-P/CNV-G_predict.sh -h" to see the usage information.
$HOME/CNV-PG/CNV-P/CNV-P_predict.sh \
-i $HOME/CNV-PG/test_data/BAMFILE \
-b $HOME/CNV-PG/test_data/BASFILE \
-e $HOME/CNV-PG/test_data/BEDFILE \
-p $HOME/python/bin/python3 \
-o $HOME/OUTDIR \
-n SAMPLENAME \
-c $HOME/CNV-PG/CNV-P \
The "BEDFILE" shuld be 5 Columns: chromsome, start, end, size of CNV, type of CNV (DUP:1,DEL:0); this also can be generate by CNV-P (such as CNVCALLER.SAMPLENAME.fil.mer.bed)
for example:
chr1 10482480 10483779 1300 0
chr1 16151940 16155439 3500 1
chr1 35101421 35111976 10556 0
chr1 39998214 40001244 3031 1
chr1 58743909 58744822 914 0
chr1 60048636 60049661 1026 0
SAMPLENAME.feature.txt #the features matrix
SAMPLENAME.pre.prop.txt #the results of genotype and probability for each CNV
Please help us improve CNV-PG by reporting bugs or ideas on how to make things better.