Python script which finds matching CDS of different GFF files.
Intended usage is to compare two gff files which should be near identical, e.g. two GFF files generated by different pipelines but from the same sequence
-
Sample GFF File
-
Template GFF File
Sample GFF file is compared against the Template GFF one.
- Gene/Protein List File
Used for outputting counts of GFF features for each gene/protein name in the file. Each name should be a new line, eg:
IRS1 RL10 RL11 RL12 RL13 RL1 RL5A RL6 ...
-
Strain Stats File: Contains summary counts of features and mimatches across sample and template files.
Filename will be TemplateGFFFileName_StrainStats.txt An example output would be:
Strain Num_CDS_Automated Num_CDS_Deposited Num_Genes_Automated Num_Genes_Deposited Num_mRNA_Automated Num_mRNA_Deposited Other/MiscFeatures_Automated Other/MiscFeatures_Deposited Total_Features_Automated Total_Features_Deposited Perfect_Matches No_Matches Total_Mismatches Mismatches_Discarded Mismatches_Logged Total_Mismatch_lenght(Logged_Only) Avg_Mismatch_Lenght AY446894.2.fasta.gff 187 188 169 173 168 20 2 132 526 513 183 0 4 3 1 48 48
-
Mismatch File: Logs detected mismatches. Mismatch treshold is 3 by default, can be changed via the -m arguement.
Filename will be TemplateGFFFileName_Mismatch.txt An example output would be:
Logging mismatches bigger than 3 Mismatch Origin ID=UL119_cds2 168891 169263 Possible Match ID=cds-AAR31662.1 168891 169311 Mismatch Length Start: 0 Stop: 48
-
Protein/Gene Counts File:
Filename will be TemplateGFFFileName_GeneCounts.txt A truncated example output would be:
Gene Num_CDS_Automated Num_CDS_Deposited Num_Genes_Automated Num_Genes_Deposited Num_mRNA_Automated Num_mRNA_Deposited Other/MiscFeatures_Automated Other/MiscFeatures_Deposited Logged_Mismatches(Sample) IRS1 1 1 1 1 1 0 0 0 0 RL10 1 1 1 1 1 0 0 0 0 ...
-
Log File: Contains general logs of gffcompare's process. Can be very long. Useful for debugging.
Filename will be TemplateGFFFileName_Log.txt A truncated example output would be:
Reading ./outputFolder/AY446894.2.fasta_annotation.gff as sample file... GGF Sample file read. Total lines: 526 CDC lines: 187 mRNA lines: 168 Gene lines: 169 Misc. Features lines: 2 Other lines: 0 Reading ./manual/AY446894.2.fasta.gff as template file... GFF Template File read. Total lines: 513 CDC lines: 188 mRNA lines: 20 Gene lines: 173 Misc. Features lines: 0 Other lines: 132 Beginning comparison of CDS features... Searching for matching Sample CDS: ('AY446894.2', 'exonerate', 'CDS', '165858', '166799', '.', '-', '.', 'ID=UL116_cds1;Parent=UL116_mRNA;Name=UL116.1;Product=UL116\n') Perfect match with Template CDS: ('AY446894.2', 'Genbank', 'CDS', '165858', '166799', '.', '-', '0', 'ID=cds-AAR31660.1;Parent=gene-HHV5wtgp102;Dbxref=NCBI_GP:AAR31660.1;Name=AAR31660.1;gbkey=CDS;gene=UL116;locus_tag=HHV5wtgp102;product=protein UL116;protein_id=AAR31660.1\r\n') Start positions: 165858 165858 Stop positions: 166799 166799 ...
gffcompare.py S T [-i -m -p]
-
S: Path of sample GFF file, e.g. ~/directory/SampleGFF.gff
-
T: Path of template GFF file, e.g. ~/directory/TemplateGFF.gff
-
-m
Mismatch Permissibility: The max length of mismatches that will NOT be logged in mismatch file. Default is 3. -
-i
Iteration Number: Max times to run fuzzy filter. Each iteration increases range by mismatch permissibility (default 3). Default is 10 -
-p
Gene Name List File: Directory of file with gene names (1 per line) to be used in generating the counts table. Default is path/proteinList.txt. A Human Cytomegalovirus Protein List is provided as an example.
-
When reading GFF files, any line starting with ## will be skipped
-
CDS, mRNA, gene, misc_features are stored as lists of tuples (each feature is 1 tuple)
For each CDS in the sample file, if there is no perfect match at the starting or ending positions, any possible matches with template CDS will be logged and filtered earlier (see Filter section)
-
Nested loop of sample CDS (outer loop) and template CDS (inner loop)
-
If start & end positions are a perfect match, start looping for next CDS in sample file
-
If only 1 perfect match in either starting or ending position, log as mismatch and move on to next sample CDS
-
If no perfect match at either start or stop position, implement Fuzzy filter
-
Fuzzy Filter allows for the starting and ending positions of a CDS to be within a threshold, which will increase with each loop of the filter. The total iterations of the filter and the mismatch permissibility can be set up by the user (defaults are 10 and 3 respectively)
-
E.g. 1st round with default parameters, the filter would have a mismatch allowance of +/-3 nucleotides, 2nd iteration would be +/- 6, final iteration would be +/-30 (10 iterations total)
-
-
If a starting OR ending position are within range, add them to possible mismatches
-
-
If no possible matches are found, log the inability to find matching CDS
For CDS that have multiple possible matches (result of matching filter). If any of the possible matches are below mismatch permissibility (default 3), no mismatches will be logged for that CDS
Goes through each Gene/Protein of the input file and counts the features (CDS, mRNA, gene, misc_features) of the Sample and Template files
A regex filter* is to count to match protein names between gene/protein list and features
Regex Filter: "(?<![A-Z]|[a-z]|[0-9]| )(?<!Note=)" + gene + "(?![A-Z]|[a-z]|[0-9]| )"
gene : the gene to search for, eg RL1. Determined from protein name file
(?<![A-Z]|[a-z]|[0-9]| ): make sure that the gene is not preceded by letters or characters, eg dont match with 1RL1 or ARL1. Space is to prevent 'comment' fields from being captured (see below)
(?<!Note=): make sure match is not preceded by 'Note=' in gff file, as some ncbi files include the 'parent' gene which can create fake duplicate pattern matching and inflate count of parent gene.
(?![A-Z]|[a-z]|[0-9]): as above, make sure that the gene is not followed by letters or characters, eg dont match with RL11 or RL1A
Overall when searching for RL1, the following should NOT match: 1RL1; ARL1; ;RL11 ;RL1A Note=RL1_ BRL1A etc
The following should match: ;RL1_ =RL1_ =RL1;
Rename headers in Gene Counts file (as well as name of file itself) for consistency