Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Planning #1

Open
hexylena opened this issue Oct 19, 2022 · 5 comments
Open

Planning #1

hexylena opened this issue Oct 19, 2022 · 5 comments

Comments

@hexylena
Copy link
Member

hexylena commented Oct 19, 2022

Given previous conversations and statements like:

btw CDS vs exon came up in my discussion with Ian Korf: genestats assumes mRNA -> exon, but it would be nice to be able to do stats of mRNA -> CDS... putting exon entries in that correspond with the CDSes seems wrong to me because CDS can be part of an exon. I'm a bit undecided but this might be a TODO item for genestats (which could do with a version update anyway)
so what I have learned thus far: mRNA -> CDS messes up the jcvi annotation stats tool. Arguably when aligning protein -> DNA all you have is CDS - i.e. mRNA -> exon isn't strictly correct. But annotation stats expects exons.
I'm currently running a test sample through Maker2 to try and figure out what might be causing the Train SNAP tool to come out with a bad HMM... the thing here being that Maker2 annotation works as input (at least in the Eukaryotic genome annotation tutorial) but my current annotation doesn't.
what I've got at the moment is gene -> mRNA -> cds
Maker produces gene -> mRNA -> exon | CDS | five_prime_UTR | three_prime_UTR
yeah I've always seen CDS in uppercase, I guess it's the "standard" (if there's any in the gff world)

It would be interesting for us, the GGA group, to collectively do a survey of what real world GFF3 files look like.

Plan

  • Obtain gff3s
    • (crowd?) source GFF3 files
    • Take random subsets of existing databases we know about (e.g. ncbi, flybase, etc)
    • Can we like .. google search gff3 files and download a random selection of those?
  • Load into Galaxy
  • Analyse
    • Does anyone use SO terms for feature type in real life? Or is that just theoretical?
    • Does everyone capitalise CDS?
    • What trees of features are seen in the real world? Is make more correct, or are other tools producing different things?
  • Share results
    • Maybe write a short paper on what we see in practice, compare with what major tools are producing (don't want to survey every tool)
    • Submit an entry to naught-binfie-files / produce our own worst gff3 ever
@abretaud
Copy link
Member

I like this idea!
Maybe there's some overlap with https://github.com/NAL-i5K/AgBioData_GFF3_recommendation? Never really had the time to explore what they did

@hexylena
Copy link
Member Author

Yeah great one!
I'll start adding URLs to stuff into this repo, and whenever we get around to it we'll just pull that entire list for analysis in galaxy.

@hexylena
Copy link
Member Author

Added every GFF3 file I could find on my laptop (sanitized for owner= email addresses that were in apollo.)

@hexylena
Copy link
Member Author

@hexylena
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants