Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Can this be used broadly for all microeukaryotes (e.g., fungi, protists, etc)? #100

Open
jolespin opened this issue Jun 3, 2023 · 8 comments

Comments

@jolespin
Copy link

jolespin commented Jun 3, 2023

I'm thinking about trying this out for a backend for my VEBA eukaryotic binning module (https://github.com/jolespin/veba) as an alternative to MetaEuk.

Looking at the examples here:

Helixer.py --lineage land_plant --fasta-path Arabidopsis_lyrata.v.1.0.dna.chromosome.8.fa  \
  --species Arabidopsis_lyrata --gff-output-path Arabidopsis_lyrata_chromosome8_helixer.gff3

I won't have the lineage or species at this point in the pipeline.

  • How important is the --lineage parameter in the gene calling process?
  • Can --species be anything? For example, can I give it a bin ID such as S1__METABAT2__bin.1?
  • Will this be available on bioconda any time soon?
  • Are there any plans to implement this with BUSCO lineages to do autodetection of lineage and then use models based on BUSCO lineages?
@BjoernUsadel
Copy link
Contributor

Hi
the lineage selects the trained models currently one of

  • land plants
  • invertebrates
  • vertebrates
  • fungi
    hence you need this or alternatively model filepath set with "--model-filepath"

Do you want to align it to the BUSCO lineages? This could be done by a simple lookup.
However as gene paramters change for the gene detection depending on these broadf lineages these are needed.
Cheers
b

@jolespin
Copy link
Author

jolespin commented Jul 5, 2023

I meant if Helixer had a model for each eukaryotic BUSCO lineage, then it does auto detection in the backend to know which model to use.

I guess I'm asking if there are plans to build out the models available.

@alisandra
Copy link
Collaborator

Hi Jolespin,

Thanks for your interest!

This is a cool idea on auto-detection. One would have to test it, but I'd hazard a guess that auto detection might work off of the confidence of Helixer's raw predictions.

That said, I don't currently have much capacity for larger schemes, so in summary

  • yes, --lineage is critical for current models, else a poorly fitting model would be used; and prediction quality would suffer massively.
  • --species should be any reasonable length non-white space string. It's going to end up in the gff3 file's gene names. A prefix such as 'At' or 'A_thal' for Arabidopsis thaliana would be reasonable. The bin ID should work without issue.
  • not in the immediate future
  • not in the immediate future

That said on the no's, if any one wants to see this enough to give it a shot, I'm happy to do what I can at a high-level; drop a line.

@jolespin
Copy link
Author

Understood the bandwidth issue. I'm in the same boat right now w/ my VEBA package.

I would be using this for microeukaryotic organisms (e.g., protists). Do you have any instructions on training a custom model? If so, what is required for this? Genomic and CDS sequences or could this be done directly from protein sequences?

@alisandra
Copy link
Collaborator

Yes, I'm currently working on the latest instructions here: https://github.com/weberlab-hhu/Helixer/blob/cleanup/docs/training.md
(I will merge them to main soon).

You will need fasta and gff3 files for the training species (it's supervised training). I am not on top of the availability of
good references in protists, I can imagine it might be a challenge.

Drop questions whenever!

@jolespin
Copy link
Author

For the training data, are genes with alternative start codons used or discarded?

@alisandra
Copy link
Collaborator

In the current default implementation, genes with non-ATG start codons are used, but the upstream region is masked. So the network will learn there's a gene there, but not receive feedback on where exactly it started.

This behavior is of course not designed for alternative start codons, but is a side effect of the partial-gene-model detection and masking, which assumes the standard genetic code.

In general, supporting genetic code variants is on the "think about it" list; not doing so yet has been a rare--but still noticeable--issue in fungi already.

@jolespin
Copy link
Author

jolespin commented Jun 20, 2024

Any updates on whether or not this should be able to handle protists soon?

I've compiled this huge protein set for protists and fungi: https://zenodo.org/records/10139451
Though, the genomes aren't available in this dataset unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants