Replies: 18 comments
-
Dear Jackson Thank you for highlighting this error. I probably will remove the option In the future, I will probably prefer annotating the genomes with DRAM #360 |
Beta Was this translation helpful? Give feedback.
-
@SilasK My primary reason for using When using That is a nice idea to try DRAM. I have never used it, but it is neat that it appears to have many levels of annotation. Note that it is pretty heavy on disk usage and RAM, however (see https://github.com/shafferm/DRAM#system-requirements):
It sounds like annotation using UniRef90 will take about 220 GB of RAM, which is roughly equivalent to the current GTDB classifier. This might not be a huge problem, but it does mean that DRAM will add one more RAM bottleneck to the pipeline. |
Beta Was this translation helpful? Give feedback.
-
Allowing you to relate the gene annotations to the genome. But you are interested in the exact gene - MAG orf link? You will be able to run DRAM without the Uniref annotation. and focus on Kegg. |
Beta Was this translation helpful? Give feedback.
-
@SilasK Thanks for the advice to look for Correct -- if relying on KOfam, then DRAM will use <50 GB RAM. That sounds like a reasonable option is RAM is an issue. |
Beta Was this translation helpful? Give feedback.
-
@SilasK I took a look at
Ideally, I would like to use the Genecatalog data to annotate the MAGs. Because of this, it would be very helpful to map the Gene ID to the specific ORF ID of the MAG. Mapping the Gene ID to the ORF ID of the MAG seems kind of tricky, though, because the Gene IDs are derived from the unbinned contigs, but the ORF IDs are from the MAGs, which are annotated after genome binning. I wonder if the easiest solution might be to cluster the MAG ORFs with the |
Beta Was this translation helpful? Give feedback.
-
@jmtsuji My understanding is that the gene catalogue already contains genes clustered at 95% identity. So each gene in the gene catalogue is representative of a cluster of multiple genes from either the contigs or genomes respectively. For contigs: contigs --> predict genes --> cluster genes at 95% identity --> representive genes --> gene catalogue For genomes: genomes --> predict genes --> cluster genes at 95% identity --> representive genes --> gene catalogue I know you can get counts of numbers of genes in each cluster but I'm not sure if there is a mapping file between each gene in the contig/genome and which cluster it belongs to. |
Beta Was this translation helpful? Give feedback.
-
@SilasK It might be beneficial to assign each gene in the entire pipeline a universally unique identifier (https://en.wikipedia.org/wiki/Universally_unique_identifier) if you are not already doing so, so you could make a mapping file /database of which gene from which contigs/genomes belongs in which cluster and representative sequence. |
Beta Was this translation helpful? Give feedback.
-
@jmtsuji This could also be used to map ORF IDs as well. |
Beta Was this translation helpful? Give feedback.
-
@SilasK Within the gene2genome.tsv.gz file is the gene column referring to the gene catalogue ID? or some gene id that is unique to that file? gene2genome.tsv.gz Gene MAG Ncopies orf2gene.tsv.gz ORF Gene |
Beta Was this translation helpful? Give feedback.
-
I think we need some file like the following:
The file headers above are completely database unnormalized, so would be very big file. But the underlying could be pretty easily represented in a normalized SQLite3 database file. |
Beta Was this translation helpful? Give feedback.
-
Ok guys. What I have in atlas is: All predicted genes have a unique name by the fact that they are based on the contig names which is itself unique. E.g. The file Now In the genome workflow, contigs are binned and dereplicated. The dereplicated MAGs are consistently named The file @jmtsuji You would like to know which orf of the MAG corresponds to a gene? Can you explain why you want this information? You could turn off the renaming of MAG contigs by setting: |
Beta Was this translation helpful? Give feedback.
-
@LeeBergstrand So the mapping tables are quite large. And most of the information is actually encoded in the names of the orfs. The SQL database would consist mainly of the two files I don't have much experience with mySQL and I have the impression that it closes the files for the average user. Also, have a look at how I do the mapping. atlas/atlas/rules/genecatalog.smk Lines 529 to 577 in fbed90e |
Beta Was this translation helpful? Give feedback.
-
In the future, I plan to make the gene catalog a little more independent from the genomes workflow. E.g. using PLASS to assemble genes directly from the raw reads for data where the assembly is not working optimally. Therefore it will become necessary to map the genes predicted from the MAGs to the Genecatalog. I think doing this with mmseq but without reclustering, there should be an option of simply mapping at 100% or so. |
Beta Was this translation helpful? Give feedback.
-
@SilasK Thanks for the helpful information! That is good to know that you are hoping to make the Genecatalog more independent of the genomes workflow long-term. (Trying out PLASS would also be interesting.) In my case, it's very helpful to relate the Genecatalog entries to the predicted ORFs of the genomes. Here are a couple workflows I commonly use:
I agree that the easiest solution to relate the Genecatalog entries to the predicted ORFs in the genomes might be to map the ORFs predicted in the MAGs to the Genecatalog (e.g., using I considered using I recently tested out the "Genecatalog - genomes" mapping method I described above on my ATLAS run (via MMSeqs2), and it gives me data I am satisfied with.
@LeeBergstrand Good to know that you are also interested in this topic! Like Silas mentioned, I think a lot of the info you mentioned in your table is already encoded in the FastA headers and/or in |
Beta Was this translation helpful? Give feedback.
-
@jmtsuji I see that if you want to look at the neighborhoods of genes that you need to know the exact gene. However, keep in mind that the MAGs are only the best and the representative genomes per species. I would suggest that you look for neighboring genes in all bins or even on all contigs. In that case, you can look at the mapping files outlined above. It would be cool to create a db for all orfs on all contigs. In this case, maybe a SQL database would be meaningful. |
Beta Was this translation helpful? Give feedback.
-
@jmtsuji The discussion with you motivated me to do some coding. Here is my example of how I would look at the neighborhood of a gene in all the contigs and bins. https://gist.github.com/SilasK/5932a6887ee4d2520b5a59cec06d09b7 What do you think? |
Beta Was this translation helpful? Give feedback.
-
@SilasK I would use SQLite3 (datafile with an SQL library interface) rather than mySQL (a full-on database server) for this application. I have pretty extensive experience with SQL (I used it a lot in undergrad, my company, and my masters), so If you need my help with designing schemas etc. I can do that. Using an ORM like SQLAlchemy also helps and SQL Alchemy can hook into pandas.
Yes, this is a concern. You would either have to have a command-line tool that creates mapping files as needed for end users or you would have to have a Python code for parsing the data found in the SQLite3 file. If we plan to go the SQL root. I'm willing to chip in some development time. I would also be interested in building this as a complimentary feature to the mapping files. Here is an example of some SQLAlchemy code: https://github.com/Micromeda/pygenprop/blob/develop/pygenprop/assignment_database.py You basically create a series of objects and then the python code syncs their data to the database file. |
Beta Was this translation helpful? Give feedback.
-
There was no activity since some time. I hope your issue is solved in the mean time. Thank you for your contributions. |
Beta Was this translation helpful? Give feedback.
-
Hello @SilasK ,
I'm now testing out ATLAS 2.6a2, using a server running Ubuntu 20.04.2.
I've noticed that setting the genecatalog to
source: genomes
in the config file seems to be broken now. (The defaultsource:contigs
works.) I now getting a MissingInputException thatgenomes/annotations/orf2genome.tsv
is missing.I am guessing this is coming from
localrules: concat_genes
ingenecatalog.smk
(see the line below that I marked with an asterisk):Is this
orf2genome.tsv
file meant to be produced by this new version of ATLAS? I do not find mention of it in the other atlas rules. It also appears to be not required in therun
command at the bottom of the rule. I wonder if this rule would still work withoutgenomes/annotations/orf2genome.tsv
in the input. (I will test this if I get the time.)I am okay to use
source: contigs
for now, but I thought I would let you know about this issue.Thanks,
Jackson
Beta Was this translation helpful? Give feedback.
All reactions