-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Can this be used broadly for all microeukaryotes (e.g., fungi, protists, etc)? #100
Comments
Hi
Do you want to align it to the BUSCO lineages? This could be done by a simple lookup. |
I meant if Helixer had a model for each eukaryotic BUSCO lineage, then it does auto detection in the backend to know which model to use. I guess I'm asking if there are plans to build out the models available. |
Hi Jolespin, Thanks for your interest! This is a cool idea on auto-detection. One would have to test it, but I'd hazard a guess that auto detection might work off of the confidence of Helixer's raw predictions. That said, I don't currently have much capacity for larger schemes, so in summary
That said on the no's, if any one wants to see this enough to give it a shot, I'm happy to do what I can at a high-level; drop a line. |
Understood the bandwidth issue. I'm in the same boat right now w/ my VEBA package. I would be using this for microeukaryotic organisms (e.g., protists). Do you have any instructions on training a custom model? If so, what is required for this? Genomic and CDS sequences or could this be done directly from protein sequences? |
Yes, I'm currently working on the latest instructions here: https://github.com/weberlab-hhu/Helixer/blob/cleanup/docs/training.md You will need fasta and gff3 files for the training species (it's supervised training). I am not on top of the availability of Drop questions whenever! |
For the training data, are genes with alternative start codons used or discarded? |
In the current default implementation, genes with non-ATG start codons are used, but the upstream region is masked. So the network will learn there's a gene there, but not receive feedback on where exactly it started. This behavior is of course not designed for alternative start codons, but is a side effect of the partial-gene-model detection and masking, which assumes the standard genetic code. In general, supporting genetic code variants is on the "think about it" list; not doing so yet has been a rare--but still noticeable--issue in fungi already. |
Any updates on whether or not this should be able to handle protists soon? I've compiled this huge protein set for protists and fungi: https://zenodo.org/records/10139451 |
I'm thinking about trying this out for a backend for my VEBA eukaryotic binning module (https://github.com/jolespin/veba) as an alternative to MetaEuk.
Looking at the examples here:
I won't have the lineage or species at this point in the pipeline.
--lineage
parameter in the gene calling process?--species
be anything? For example, can I give it a bin ID such asS1__METABAT2__bin.1
?The text was updated successfully, but these errors were encountered: