'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

jduvick · 2016-03-02T00:40:03Z

Background
In xGDBvm's annotation workflow, user-provided ~anno.gff3 file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a unique geneId, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data: ~annot.mrna.fa (transcripts) and ~annot.pep.fa (translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (via getRecord.pl), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).

This functionality in 'getRecord.pl' depends on a unique database value geneId or transcript_id found in the GFF3-parsed table gseg_gene_annotation or cpgat_gene_annotation, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under /xGDBvm/data/GDBnnn/data/BLAST/, and the requisite queries, paths, and hypertext additions are set by the DSO module SequenceTrack.pm.

The issue
Unfortunately, there is no guarantee that the geneId and/or transcript_id parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.

So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.

Possible solutions
xGDBvm already provides a sequence validation script (from validate_files.php and xGDB_ValidateFiles.sh) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.

So along these lines, one possible solution would be to extend the validation process to include an analysis of available ~annot.gff3, ~annot.mrna.fa and ~annot.pep.fa files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

jduvick commented Mar 2, 2016

jduvick commented Mar 2, 2016

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

Comments

jduvick commented Mar 2, 2016

jduvick commented Mar 2, 2016