Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refseq_virus/NCBI taxonomy behind current VMR.39 on more than a year #86

Open
igortru opened this issue Sep 10, 2024 · 17 comments
Open

Comments

@igortru
Copy link

igortru commented Sep 10, 2024

I would suggest rebuild metabuli viral reference database using current official viral taxonomy provided on https://ictv.global/vmr

Technically, it is not difficult.
Your tool processed it very smoothly.
If interested, I can provide you url where new database can be downloaded.
You will be suprprised how many virus names/lineages can not be synchronized between these two major resources - thousands...

@jaebeom-kim
Copy link
Member

Thank you for pointing this out and great to hear that Metabuli was working well.
I should check the resource right away.
Could you please provide the url to download the DB?
I want to test and reproduce it.
Does the resource provide NCBI-style .dmp files like nodes.dmp and names.dmp?

@igortru
Copy link
Author

igortru commented Sep 10, 2024

Do you have access to AWS S3?
s3://serratus-public/igortu/metabuli.VMR.39/
or
https://serratus-public.s3.amazonaws.com/igortu/metabuli.VMR.39/

This directory contains all required files including dmp and nucleotide fasta >2Gb.
whole VMR taxonomy now has new taxids incompatible with NCBI taxonomy.
Full synchronization between VMR and NCBI taxonomy ids right now impossible.

@jaebeom-kim
Copy link
Member

Great! I'm downloading the directory.
I'd like to know how did you generate the dmp files.
Does VMR provide them?

@jaebeom-kim
Copy link
Member

Thank you for detailed explanation!
I want to provide users both: viral DB with NCBI taxonomy and viral DB with VMR taxonomy. There are some reasons why I wanted know how you generated dmp files.

  1. I want to be able to build a DB with VMR for myself in order to provide an updated DB when new VMR is releasesd.
  2. I want to build VMR-based DB using human and viral genomes.
    Could you share your knowledges or scripts?

@igortru
Copy link
Author

igortru commented Sep 11, 2024

as first step take column "species" from VMR and try map them into NCBI names.dmp file

interesting how many species from VMR can not be mapped to current NCBI taxonomy.

I'll do it myself later today as well. I want recreate metabuli reference database with accession2speciestaxid

@igortru
Copy link
Author

igortru commented Sep 11, 2024

NCBI taxonomy outdated on species level as well.
I see many ICTV species just not presented in NCBI taxonomy

Alphacrustrhavirus wenling
2846657 | Alphacrustrhavirus wenling | | scientific name |
Alphacrustrhavirus zhejiang
2846696 | Alphacrustrhavirus zhejiang | | scientific name |
Alphadintovirus mayetiola
2843674 | Alphadintovirus mayetiola | | scientific name |
Alphadrosrhavirus hubei
2844861 | Alphadrosrhavirus hubei | | scientific name |
Alphadrosrhavirus shayang
2846193 | Alphadrosrhavirus shayang | | scientific name |
Alphaendornavirus agarici
2734345 | Alphaendornavirus agarici | | scientific name |
Alphaendornavirus basellae
Alphaendornavirus capsici
Alphaendornavirus cucumis
Alphaendornavirus cyamopsis
Alphaendornavirus erysiphes
Alphaendornavirus fucapsici
Alphaendornavirus fuphaseoli
Alphaendornavirus helianthi

@igortru
Copy link
Author

igortru commented Sep 11, 2024

s3://serratus-public/igortu/metabuli.VMR.39/vmr.gc - genetic codes for genbank accessions
s3://serratus-public/igortu/metabuli.VMR.39/vmr.39.tsv - VMR.39, typos in genbank accessions fixed, you can compare it with original version and see my fixes , just export xlsx to tsv.

s3://serratus-public/igortu/metabuli.VMR.39/ReadTSV.py

you can modify script as you need:
python ReadTSV.py vmr.39.tsv vmr.gc
produce all dmp files.

@igortru
Copy link
Author

igortru commented Sep 11, 2024

Viral Refseq last time was synchronized with VMR in May,2023.
VMR.39 released May 2024.

@jaebeom-kim
Copy link
Member

Thank you so much for sharing your DB.
I downloaded and used it to classify SARS-CoV-2 reads, but I found that SARS-CoV-2 is not included in the DB.
Is it missed due to the discrepancy between RefSeq and VMR?

  • I'll try to build a VMR-based DB after Korean Thanksgiving.

@igortru
Copy link
Author

igortru commented Sep 15, 2024

SARS-COV2-2 : severe acute respiratory syndrome coronavirus 2 

presented in VMR.39 as MN908947:NC_045512 -
it has new species name "Betacoronavirus pandemicum"

NCBI Taxonomy still report it
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2697049
as "Severe acute respiratory syndrome-related coronavirus" species

After you you rebuild metabuli viral db I would suggest you try it on ICTV challenge:
https://ictv-vbeg.github.io/ICTV-TaxonomyChallenge/

@igortru
Copy link
Author

igortru commented Sep 26, 2024

VMR.39.2 just released,
new column "taxid" added,check how many Isolates not matched.

@jaebeom-kim
Copy link
Member

jaebeom-kim commented Oct 15, 2024

Hi Igor !! I finally built a virus DB using VMR.39.2
Could you try this database and give feedback?
It's available here.
https://hulk.mmseqs.com/jaebeom/vmr39.2/
(Some viruses without genbank accession were missed)

Please use this version.
https://github.com/jaebeom-kim/Metabuli
The DB is not compatible to the latest release.

@igortru
Copy link
Author

igortru commented Oct 15, 2024 via email

@Lelouchzhu
Copy link

Betacoronavirus pandemicum

This is a good example of how ridiculous nomenclature can be. Now all sarbecoviruses are all called this, while demolishing previous species level would definitely cause more confusion than understanding.

@igortru
Copy link
Author

igortru commented Oct 15, 2024

"Virus Name" is not really tax rank.

it is assembly name or sequence name, it can be non unique, linked to assembly accession(NCBI), proteome_id (UniRef), isolate_id(ICTV).

Taxonomy tree after VMR introduction should contain only officially recognized ICTV names, everything else need to be organized different way outside of taxonomy.
It requires some operational level on top of archieval INSDC/SRA , but it looks like nobody ready for this step yet.

As one possible decision , see Serratus.io project which process whole SRA in real time (sponsor -Amazon cloud)

@jaebeom-kim
Copy link
Member

@igortru

Only question I have for now:How you deal with bacterial genomes(accessions for prophages inside complete chromosomes) which presented in VMR? included full chtomosomeexcluded completelyonly prophage region included?

I used full sequences.

@igortru
Copy link
Author

igortru commented Oct 22, 2024

you need take location provided in VMR or exclude them completely,
otherwise you will have a lot of False Pisitive hits into Bacterial lineages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants