You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
In xGDBvm's annotation workflow, user-provided ~anno.gff3 file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a unique geneId, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data: ~annot.mrna.fa (transcripts) and ~annot.pep.fa (translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (via getRecord.pl), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).
This functionality in 'getRecord.pl' depends on a unique database value geneId or transcript_id found in the GFF3-parsed table gseg_gene_annotation or cpgat_gene_annotation, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under /xGDBvm/data/GDBnnn/data/BLAST/, and the requisite queries, paths, and hypertext additions are set by the DSO module SequenceTrack.pm.
The issue
Unfortunately, there is no guarantee that the geneId and/or transcript_id parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.
So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.
Possible solutions
xGDBvm already provides a sequence validation script (from validate_files.php and xGDB_ValidateFiles.sh) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.
So along these lines, one possible solution would be to extend the validation process to include an analysis of available ~annot.gff3, ~annot.mrna.fa and ~annot.pep.fa files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.
Other solutions could also be explored.
The text was updated successfully, but these errors were encountered:
I should point out that the screenshot in my previous note is a CpGAT annotation record, but there is no isssue with CpGAT annotations, as those GFF3 and FASTA files have matching IDs by design. The issue is with user-provided files from other genome annotation projects.
Background
In xGDBvm's annotation workflow, user-provided
~anno.gff3
file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a uniquegeneId
, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data:~annot.mrna.fa
(transcripts) and~annot.pep.fa
(translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (viagetRecord.pl
), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).This functionality in 'getRecord.pl' depends on a unique database value
geneId
ortranscript_id
found in the GFF3-parsed tablegseg_gene_annotation
orcpgat_gene_annotation
, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under/xGDBvm/data/GDBnnn/data/BLAST/
, and the requisite queries, paths, and hypertext additions are set by theDSO
moduleSequenceTrack.pm
.The issue
Unfortunately, there is no guarantee that the
geneId
and/ortranscript_id
parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.
Possible solutions
xGDBvm already provides a sequence validation script (from
validate_files.php
andxGDB_ValidateFiles.sh
) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.So along these lines, one possible solution would be to extend the validation process to include an analysis of available
~annot.gff3
,~annot.mrna.fa
and~annot.pep.fa
files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.Other solutions could also be explored.
The text was updated successfully, but these errors were encountered: