Building genbank/refseq databases from assembly_summary.txt #7

luizirber · 2020-07-17T16:01:50Z

Each subset of RefSeq and GenBank has an assembly_summary.txt file.
This is from fungi: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
All refseq subsets: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Benefits of using assembly_summary.txt:

We can generate good names for signatures from the columns, using assembly_accession, organism_name, infraspecific_name and asm_name. For example, for GCF_001477545.1, the name could be GCF_001477545.1 Pneumocystis carinii B80 strain=B80, Pneu_cari_B80_V3
The taxid field can be used to generate TaxInfo and save it in the Zipped SBT during indexing. Because we control both the name (instead of using --name-from-first) and how it is saved in the TaxInfo, scripts for converting results like gather_to_opal.py can be simplified.
We can distribute one database per refseq/genbank subset, so people don't need to download a gigantic one for everything, but if they want to use all of them it's not a problem too (just list them all in gather or search)

More info: https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

The text was updated successfully, but these errors were encountered:

ctb · 2022-04-02T13:13:12Z

Much belated update that I think finally resolves this -

The ncbi-assemblies examples in https://github.com/sourmash-bio/database-examples show how to use assembly_summary.txt to generate information in the new fromfile format merged in sourmash-bio/sourmash#1885.

moreover, I'm 99.9% sure that wort uses the same information as in assembly_summary.txt to name the signatures that are automatically generated.

closing!

This was referenced Jul 17, 2020

build a streamlined Genbank/RefSeq based on GTDB 25k genomes #8

Closed

what should our signature names look like for databases? #9

Closed

This was referenced Jul 18, 2020

upgrade sourmash_databases soon (for sourmash 4.0) #10

Closed

can we / should we integrate genomepy? sourmash-bio/sourmash#775

Closed

[MRG] Building databases from assembly_summary.txt #11

Closed

ctb mentioned this issue Dec 28, 2020

building large LCA databases for genbank subsets sourmash-bio/sourmash#1264

Closed

This was referenced Jul 14, 2021

add more input formats to sourmash taxonomy prepare sourmash-bio/sourmash#1669

Open

rework database construction and release process to use manifests sourmash-bio/sourmash#1652

Closed

ctb mentioned this issue Apr 1, 2022

[MRG] implement sourmash sketch fromfile sourmash-bio/sourmash#1885

Merged

12 tasks

ctb closed this as completed Apr 2, 2022

ctb mentioned this issue May 1, 2022

sourmash database construction - current status and future thoughts sourmash-bio/sourmash#2015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building genbank/refseq databases from assembly_summary.txt #7

Building genbank/refseq databases from assembly_summary.txt #7

luizirber commented Jul 17, 2020 •

edited

Loading

ctb commented Apr 2, 2022 •

edited

Loading

Building genbank/refseq databases from assembly_summary.txt #7

Building genbank/refseq databases from assembly_summary.txt #7

Comments

luizirber commented Jul 17, 2020 • edited Loading

ctb commented Apr 2, 2022 • edited Loading

luizirber commented Jul 17, 2020 •

edited

Loading

ctb commented Apr 2, 2022 •

edited

Loading