Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files. #34

Open
jduvick opened this issue Mar 2, 2016 · 1 comment

Comments

@jduvick
Copy link
Contributor

jduvick commented Mar 2, 2016

Background
In xGDBvm's annotation workflow, user-provided ~anno.gff3 file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a unique geneId, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data: ~annot.mrna.fa (transcripts) and ~annot.pep.fa (translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (via getRecord.pl), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).

snapshot 3 1 16 5 58 pm

This functionality in 'getRecord.pl' depends on a unique database value geneId or transcript_id found in the GFF3-parsed table gseg_gene_annotation or cpgat_gene_annotation, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under /xGDBvm/data/GDBnnn/data/BLAST/, and the requisite queries, paths, and hypertext additions are set by the DSO module SequenceTrack.pm.

The issue
Unfortunately, there is no guarantee that the geneId and/or transcript_id parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.

So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.

Possible solutions
xGDBvm already provides a sequence validation script (from validate_files.php and xGDB_ValidateFiles.sh) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.

So along these lines, one possible solution would be to extend the validation process to include an analysis of available ~annot.gff3, ~annot.mrna.fa and ~annot.pep.fa files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.

Other solutions could also be explored.

@jduvick
Copy link
Contributor Author

jduvick commented Mar 2, 2016

I should point out that the screenshot in my previous note is a CpGAT annotation record, but there is no isssue with CpGAT annotations, as those GFF3 and FASTA files have matching IDs by design. The issue is with user-provided files from other genome annotation projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant